Count min sketch (CMS) is a time and memory efficient probabilistic data structure that uses hash functions to compress data and estimate the frequency an item has appeared in a big data stream. In this work, starting from the Bayesian Nonparametric view on CMS given by Cai, Adams and Mitzenmacher [4] we develop a novel learning-augmented CMS under the assumption of power-law data streams. In particular, we model the unknown prior distribution of the tokens into the data stream to follow a Normalized inverse gaussian process (NIGP). Thanks to the predictive distributions and the restrictive and projective properties of the NIGP we can obtain the posterior distribution of the frequency of an item in the data stream, given the hashed frequencies, and in turn BNP estimates. Applications to synthetic and real data show that our approach achieves a remarkable performance in the estimation of low-frequency tokens. The ability of estimate statistics from a power-law behaviour is really important in modern natural language processing since it is something you often have to deal with.

A Bayesian nonparametric view on Count-Min Sketch under power-law data streams

BENETTI, LUCA
2019/2020

Abstract

Count min sketch (CMS) is a time and memory efficient probabilistic data structure that uses hash functions to compress data and estimate the frequency an item has appeared in a big data stream. In this work, starting from the Bayesian Nonparametric view on CMS given by Cai, Adams and Mitzenmacher [4] we develop a novel learning-augmented CMS under the assumption of power-law data streams. In particular, we model the unknown prior distribution of the tokens into the data stream to follow a Normalized inverse gaussian process (NIGP). Thanks to the predictive distributions and the restrictive and projective properties of the NIGP we can obtain the posterior distribution of the frequency of an item in the data stream, given the hashed frequencies, and in turn BNP estimates. Applications to synthetic and real data show that our approach achieves a remarkable performance in the estimation of low-frequency tokens. The ability of estimate statistics from a power-law behaviour is really important in modern natural language processing since it is something you often have to deal with.
ENG
IMPORT DA TESIONLINE
File in questo prodotto:
File Dimensione Formato  
905301_tesi_benetti.pdf

non disponibili

Tipologia: Altro materiale allegato
Dimensione 649.75 kB
Formato Adobe PDF
649.75 kB Adobe PDF

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14240/29996