The necessity to realize requests of information from databases (query) is very popular in many fields, from medical to industrial sector, passing through educational one. In many disciplines the amount of data that should be managed is dramatically huge (big data problem) and even the most advanced processors can take unacceptable time to perform query. We will show that to execute a query in the best possible way it is important to know how many distinct values there are in the considered database: knowing the number of distinct values, it is possible to understand the best order of steps to do in order to construct a good query plan. The right choice of the most efficient algorithm to apply to a particular query is of paramount importance to get the best result in the shortest time. In this thesis two algorithmic estimators for distinct values are implemented using a non-parametric approach for Zipfian distribution, which is a type of distribution with heavy tails. The aim is to analyze their performance both in terms of errors and computational time. Possible future works can be done in this field in order to find estimators that lower execution times (in particular for highly skewed data) and reduce the error for small sample sizes.
A non-parametric approach to query optimization under heavy tails
PALESTINI, SIMONA
2016/2017
Abstract
The necessity to realize requests of information from databases (query) is very popular in many fields, from medical to industrial sector, passing through educational one. In many disciplines the amount of data that should be managed is dramatically huge (big data problem) and even the most advanced processors can take unacceptable time to perform query. We will show that to execute a query in the best possible way it is important to know how many distinct values there are in the considered database: knowing the number of distinct values, it is possible to understand the best order of steps to do in order to construct a good query plan. The right choice of the most efficient algorithm to apply to a particular query is of paramount importance to get the best result in the shortest time. In this thesis two algorithmic estimators for distinct values are implemented using a non-parametric approach for Zipfian distribution, which is a type of distribution with heavy tails. The aim is to analyze their performance both in terms of errors and computational time. Possible future works can be done in this field in order to find estimators that lower execution times (in particular for highly skewed data) and reduce the error for small sample sizes.File | Dimensione | Formato | |
---|---|---|---|
826271_tesi.pdf
non disponibili
Tipologia:
Altro materiale allegato
Dimensione
723.79 kB
Formato
Adobe PDF
|
723.79 kB | Adobe PDF |
I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/20.500.14240/46931