Comparative Analysis of Statistical Models for Computing Probability of Default: Application of Machine Learning Techniques in Python

This thesis focuses on the application of six different machine learning algorithms for estimating probability of default (PD), with the primary goal of improving the model’s ability to accurately identify true defaulters. The algorithms used include XGBoost, AdaBoost, Random Forest, Decision Tree, Logistic Regression, and Gradient Boosting. By analyzing performance metrics such as recall and precision, this study aims to balance the correct identification of defaulters (recall) with confidence that those identified are actual defaulters (precision). The analysis is based on a banking dataset from the book H. Scheule, D. Roesch, B. Baesens, Credit Risk Analytics: The R Companion (2017), which contains key information on individuals who defaulted or not. A rigorous preprocessing approach was applied to this dataset, including SMOTE and stratify techniques to address potential imbalances between defaulters and non-defaulters. The results showed that, while recall is critical for maximizing defaulter identification, improving precision is equally necessary to reduce false positives. Among the algorithms, XGBoost and Random Forest demonstrated the best performance in balancing these two metrics, making them particularly effective for predicting the probability of default. These findings have significant implications for the financial sector, where accurate PD predictions are crucial for effective credit risk management.

Questa tesi si concentra sull'applicazione di sei diversi algoritmi di machine learning per la stima della probabilità di default (PD), con l'obiettivo principale di migliorare la capacità del modello di identificare i veri defaulters. Gli algoritmi utilizzati includono XGBoost, AdaBoost, Random Forest, Decision Tree, Regressione Logistica e Gradient Boosting. Attraverso un'attenta analisi delle metriche di valutazione, come recall e precision, lo studio mira a bilanciare la corretta identificazione dei defaulters (recall) con la fiducia che questi siano effettivamente defaulters (precision). L'analisi si basa su un dataset bancario proveniente dal libro “H. Scheule, D. Roesch, B. Baesens, Credit Risk Analytics: The R Companion” (2017), che contiene le principali informazioni sulle persone che hanno subito default o meno. Il dataset è stato sottoposto a un processo rigoroso di pre-elaborazione, includendo tecniche come stratify e SMOTE per affrontare l'eventuale squilibrio tra defaulters e non-defaulters. I risultati hanno dimostrato che, sebbene il recall sia cruciale per massimizzare l'identificazione dei defaulters, anche una maggiore precision è necessaria per evitare falsi positivi. Tra gli algoritmi, XGBoost e Random Forest hanno mostrato le migliori prestazioni nel bilanciamento tra queste due metriche, rendendoli particolarmente efficaci per la previsione della probabilità di default. Questi risultati possono avere importanti implicazioni per il settore finanziario, dove una previsione accurata della PD è fondamentale per la gestione del rischio di credito.