Count min sketch (CMS) is a time and memory efficient probabilistic data structure that uses hash functions to compress data and estimate the frequency an item has appeared in a big data stream. In this work, starting from the Bayesian Nonparametric view on CMS given by Cai, Adams and Mitzenmacher [4] we develop a novel learning-augmented CMS under the assumption of power-law data streams. In particular, we model the unknown prior distribution of the tokens into the data stream to follow a Normalized inverse gaussian process (NIGP). Thanks to the predictive distributions and the restrictive and projective properties of the NIGP we can obtain the posterior distribution of the frequency of an item in the data stream, given the hashed frequencies, and in turn BNP estimates. Applications to synthetic and real data show that our approach achieves a remarkable performance in the estimation of low-frequency tokens. The ability of estimate statistics from a power-law behaviour is really important in modern natural language processing since it is something you often have to deal with.
A Bayesian nonparametric view on Count-Min Sketch under power-law data streams
BENETTI, LUCA
2019/2020
Abstract
Count min sketch (CMS) is a time and memory efficient probabilistic data structure that uses hash functions to compress data and estimate the frequency an item has appeared in a big data stream. In this work, starting from the Bayesian Nonparametric view on CMS given by Cai, Adams and Mitzenmacher [4] we develop a novel learning-augmented CMS under the assumption of power-law data streams. In particular, we model the unknown prior distribution of the tokens into the data stream to follow a Normalized inverse gaussian process (NIGP). Thanks to the predictive distributions and the restrictive and projective properties of the NIGP we can obtain the posterior distribution of the frequency of an item in the data stream, given the hashed frequencies, and in turn BNP estimates. Applications to synthetic and real data show that our approach achieves a remarkable performance in the estimation of low-frequency tokens. The ability of estimate statistics from a power-law behaviour is really important in modern natural language processing since it is something you often have to deal with.File | Dimensione | Formato | |
---|---|---|---|
905301_tesi_benetti.pdf
non disponibili
Tipologia:
Altro materiale allegato
Dimensione
649.75 kB
Formato
Adobe PDF
|
649.75 kB | Adobe PDF |
I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/20.500.14240/29996