Regole associative: un confronto fra l’algoritmo Apriori e l’algoritmo di crescita del pattern frequente

The thesis examines a data mining’s central topic: association rules. Association rules mean a relationship between two itemsets (elements, from a large database, purchased jointly). This relationship can be quantified through three distinct measures: support, confidence and lift. An important element is the research for frequent itemsets from a large amount of data. A frequent itemset has support higher than its threshold value. In this respect, it is possible to choose between different identification approaches. The following paper aims to present the two “frequent itemset” research strategies most adopted by the discipline. The Apriori algorithm adopts an iterative approach to consult all the itemsets. It starts considering frequent sets composed by a single element; at the next step the previous frequent items are added two by two; so on, until frequent k-itemsets are found. The FP growth algorithm can be considered an evolution of the previous algorithm. Unlike the Apriori, it does not need to scan the whole database several times but only twice. This advantage comes from the fact that it is based on a tree structure which contains all the information. The FP growth algorithm adopts a recursive approach. Frequent items are identified by ordering them in decreasing order of frequency; from the previous constructed tree, will be obtained one tree per each itemset; now the combinations of frequent goods are determined. Therefore, if the first paradigm is based on the principle "if a set of elements is frequent, all its subsets must be too", the latter bases its steps on the "divide et impera" paradigm.

La tesi prende in esame un argomento centrale del data mining: le regole associative. Per regole associative si intende una relazione fra due itemset (elementi, di un grande database, acquistati congiuntamente). Tale relazione può essere quantificata attraverso tre misure distinte: il supporto, la confidenza e il lift. Un elemento di rilievo è la ricerca di itemset frequenti da una grande mole di dati. Un itemset frequente presenta un supporto di valore superiore al relativo valore soglia. A tal proposito, è possibile scegliere fra approcci di individuazione differenti. Il seguente elaborato mira a presentare le due strategie di ricerca di “frequent itemset” maggiormente adottate dalla disciplina. L’algoritmo Apriori adotta un approccio iterativo per esplorare tutti gli itemset. Si inizia considerando i set frequenti composti da un unico elemento; allo step successivo si uniscono gli item frequenti precedenti a due a due; così via fino all’individuazione di k-itemset frequenti. L’algoritmo di crescita del pattern frequente può essere considerato un’evoluzione dell’algoritmo precedente. A differenza dell'Apriori, esso non necessita di scansionare più volte l’intero database ma esclusivamente due volte. Questo vantaggio deriva dal fatto che esso si basa su una struttura ad albero che contiene tutte le informazioni. L’algoritmo di crescita del pattern frequente adotta un approccio ricorsivo. Si individuano gli items frequenti ordinandoli in ordine decrescente della frequenza; dall’albero costruito precedentemente, se ne ricava uno condizionatamente a ciascun itemset; da qui si determinano le combinazioni di beni frequenti. Pertanto, se il primo paradigma si basa sul principio “se un insieme di elementi è frequente, anche tutti i suoi sottoinsiemi devono esserlo”, quest’ultimo fonda i suoi step sul paradigma “divide et impera”.