The era of big data has led to an exponential growth in the amount of information available, surpassing our current capabilities to effectively handle and process it. In response, compact data structures have emerged as a solution to efficiently represent and extract information from large data streams, albeit at the cost of sacrificing accuracy. Sketches, a type of compact randomized data structure, have gained prominence for their ability to estimate various statistics from massive data streams with significant gains in both time and memory efficiency. This paper focuses on token-based data streams, where a summary is created by applying sketching techniques using random hash functions. Specifically, we explore a Bayesian nonparametric (BNP) estimator that models the data stream of tokens through a Dirichlet process (DP) prior. Building upon the work introduced by Favaro et al. (2022), we present a revised version of the estimator and derive an explicit expression for its variance, enabling us to quantify its uncertainty.

Bayesian nonparametric cardinality estimation

FIOCCHI, FILIPPO
2022/2023

Abstract

The era of big data has led to an exponential growth in the amount of information available, surpassing our current capabilities to effectively handle and process it. In response, compact data structures have emerged as a solution to efficiently represent and extract information from large data streams, albeit at the cost of sacrificing accuracy. Sketches, a type of compact randomized data structure, have gained prominence for their ability to estimate various statistics from massive data streams with significant gains in both time and memory efficiency. This paper focuses on token-based data streams, where a summary is created by applying sketching techniques using random hash functions. Specifically, we explore a Bayesian nonparametric (BNP) estimator that models the data stream of tokens through a Dirichlet process (DP) prior. Building upon the work introduced by Favaro et al. (2022), we present a revised version of the estimator and derive an explicit expression for its variance, enabling us to quantify its uncertainty.
ENG
IMPORT DA TESIONLINE
File in questo prodotto:
File Dimensione Formato  
977691_thesis_fiocchi_final.pdf

non disponibili

Tipologia: Altro materiale allegato
Dimensione 1 MB
Formato Adobe PDF
1 MB Adobe PDF

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14240/147966