Benchmark e valutazione di alternative ai classificatori binari nella phishing detection.

The web phishing scenario is still very current in social engineering attacks, this can present itself in different forms and is difficult to detect despite a careful examination of the pages to distinguish the malicious ones from the legitimate ones. Currently, to detect phishing pages, among the many techniques, binary classifiers are used, which although effective, for their development, present a heavy but necessary work to identify and label the data manually. The idea proposed by the company Ermes Cyber Security, therefore, is to try to explore possible alternative solutions to binary classifiers. In the following paper, two models were tested as alternatives: One-class and PU-learning. The One-class model was chosen because for training it only needs samples of the target class. PU-learning, on the other hand, uses a method that involves the use of a dataset with a sample of labeled positive data, while the other elements remain unlabeled. The algorithm will label only the elements it considers positive, without assigning any label to the rest of the data. In the development of the following thesis, after having carried out research in the literature regarding the phishing detection scenario, research was carried out in the direction of the two models of interest. Subsequently, the following classifiers were implemented: One-Class SVM, One-Class SGD, PU-learning Standard Classification, PU-learning bagging model, PU-learning Two-Steps method, LDA, Random Forest and XGBoost. The last three are the best performing binary classifiers in the literature and were used to make a comparison with the results obtained by the One-class and PU-learning models. The research questions on which the following work is developed are: "Could a One-class classifier equal the performance of a binary classifier?" and "Could a PU-learning classifier equal the performance of a binary classifier?". The primary interest is not to improve performance, since binary classifiers already have it, but to understand if, with equal performance, it is possible to use methodologies that simplify the collection and the data labeling mechanism. What emerges from the thesis is that, in terms of performance, PU-learning could equal the performance of a binary classifier, unlike One-class. This result represents an interesting starting point for a possible inclusion of PU-learning in phishing classification for two reasons: the first is that it can improve the problem of data collection and labeling, the second is that it uses a binary classifier as a basis. Therefore, if a binary classifier was previously used, this can be easily integrated into PU-learning.

Lo scenario del web phishing è ancora molto attuale negli attacchi di social engineering, questo può presentarsi sotto forme diverse ed è difficile da individuare nonostante si effettui un esame attento delle pagine per distinguere quelle malevole da quelle legittime. Ad oggi, per rilevare le pagine di phishing, tra le tante tecniche si utilizzano classificatori binari, che per quanto efficaci, per il loro sviluppo, presentano un oneroso ma necessario lavoro per individuare ed etichettare i dati manualmente. L'idea proposta dall'azienda Ermes Cyber Security, quindi, è quella di tentare di esplorare delle possibili soluzioni alternative ai classificatori binari. Nel seguente elaborato come alternative sono stati presi in considerazione due modelli: il One-class e il PU-learning. Il modello One-class è stato scelto poiché per il training necessita solo dei campioni della classe da etichettare. Il PU-learning, invece, utilizza un metodo che prevede l'uso di un dataset con un campione di dati positivi etichettati, mentre il resto rimane non etichettato. L'algoritmo etichetterà solo gli elementi che ritiene positivi, senza assegnare alcuna etichetta al resto dei dati. Nello sviluppo della seguente tesi, dopo aver effettuato le ricerche in letteratura riguardante lo scenario della phishing detecion, si sono effettuate ricerche nella direzione dei due modelli di interesse. Successivamente sono stati implementati i classificatori: One-Class SVM, One-Class SGD, PU-learning Standard Classification, PU-learning bagging model, PU-learning Two-Steps method, LDA, Random Forest e XGBoost. Gli ultimi tre sono i classificatori binari più performanti in letteratura e sono stati utilizzati per poter effettuare un confronto con i risultati ottenuti dai modelli del One-class e i PU-learning. Le domande cardine su cui si sviluppa il seguente lavoro sono: "Un classificatore One-class può equiparare le prestazioni di un classificatore binario?" e "Un classificatore PU-learning può equiparare le prestazioni di un classificatore binario?". L'interesse primario non è quello di migliorare le prestazioni, poichè i classificatori binari già le posseggono, ma quello di capire se a parità di prestazioni è possibile utilizzare metodologie che semplificano la raccolta ed etichettatura dei dati. Quello che si evince dalla tesi è che, a livello di prestazioni, il PU-learning potrebbe equiparare le prestazioni di un classificatore binario, al contrario del One-class. Questo risultato rappresenta un punto di partenza interessante per un possibile inserimento del PU-learning nella classificazione del phishing per due ragioni: la prima è che può migliorare la problematica della raccolta dei dati e della loro etichettatura, la seconda è che utilizza come base un classificatore binario. Pertanto, se in precedenza si utilizzava un classificatore binario, questo è facilmente integrabile nel PU-learning.