A Bayesian nonparametric view on Count-Min Sketch
under power-law data streams

Count min sketch (CMS) is a time and memory efficient probabilistic data structure that uses hash functions to compress data and estimate the frequency an item has appeared in a big data stream. In this work, starting from the Bayesian Nonparametric view on CMS given by Cai, Adams and Mitzenmacher [4] we develop a novel learning-augmented CMS under the assumption of power-law data streams. In particular, we model the unknown prior distribution of the tokens into the data stream to follow a Normalized inverse gaussian process (NIGP). Thanks to the predictive distributions and the restrictive and projective properties of the NIGP we can obtain the posterior distribution of the frequency of an item in the data stream, given the hashed frequencies, and in turn BNP estimates. Applications to synthetic and real data show that our approach achieves a remarkable performance in the estimation of low-frequency tokens. The ability of estimate statistics from a power-law behaviour is really important in modern natural language processing since it is something you often have to deal with.

A Bayesian nonparametric view on Count-Min Sketch under power-law data streams

BENETTI, LUCA

2019/2020

Abstract

Count min sketch (CMS) is a time and memory efficient probabilistic data structure that uses hash functions to compress data and estimate the frequency an item has appeared in a big data stream. In this work, starting from the Bayesian Nonparametric view on CMS given by Cai, Adams and Mitzenmacher [4] we develop a novel learning-augmented CMS under the assumption of power-law data streams. In particular, we model the unknown prior distribution of the tokens into the data stream to follow a Normalized inverse gaussian process (NIGP). Thanks to the predictive distributions and the restrictive and projective properties of the NIGP we can obtain the posterior distribution of the frequency of an item in the data stream, given the hashed frequencies, and in turn BNP estimates. Applications to synthetic and real data show that our approach achieves a remarkable performance in the estimation of low-frequency tokens. The ability of estimate statistics from a power-law behaviour is really important in modern natural language processing since it is something you often have to deal with.

Scheda breve

	Facoltà/Dipartimento
	
				MATEMATICA "GIUSEPPE PEANO"
			
	Corso di studio
	
				STOCHASTICS AND DATA SCIENCE
			
	Lingua
	
				ENG
			
	Relatrice / Relatore
	
				FAVARO, Stefano
			
	Modalità consultazione tesi
	
				IMPORT DA TESIONLINE
			
	Appare nelle tipologie:
	
				Corso di Laurea Magistrale

File in questo prodotto:

File	Dimensione	Formato
905301_tesi_benetti.pdf non disponibili Tipologia: Altro materiale allegato Dimensione 649.75 kB Formato Adobe PDF	649.75 kB	Adobe PDF

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14240/29996