This master’s thesis has the purpose to analyze a dataset supplied by Banca di Asti, a bank situated in Northern Italy, throught the use of RStudio software. The final goal is to capture the main features of customers drafting two type of insurance products: healthy and home insurance. In the set, customers can be both individuals or societies\organizations. The healthy insurance is a new item on the brochure; thus, the decommissioned ones are included in the study considering them as an integrated continuation of the newest. The First Chapter is an Introduction where the Bank’ s Management Data System and the definition of CRISP-DM criterion (Objectives Setting, Data Understanding, Data Preparation, Pattern Mining & Model Building, Evaluation and Deployment) are explained. Dealing with AI - in general with Data Mining (DM) - reporting some critiques and awareness on its use (Data Quality, Costs, Privacy, Human Rights, ...) is necessary. In the Second Chapter Lloyd Algorithm is included to plot geographically bank’s branches through the so- called Voronoi tessellation. It is followed by the Data Collection section where the main characteristics of the dataset are highlighted: variables and their type, number of observations and initial problems. It ends with all the Data Cleaning steps during which classical problem are treated: type changing, typo substitutions, data mismatching, NA (or missing) value replacement and management of categorical variables. Here, the Extraction & Transformation technique is exploited to deal with the anagraphic variables; some are not directly in the historical archive - needed to locally describe customers around drafting date. Third Chapter is about Distributions of data. Information like geographical dispersion in the country, frequency of anagraphic features of customers (i.e. number of male\female, number of societies\individuals, most frequency of payments chosen, ...) are scouted. Trend in the price of contracts over time is reported to see particular behavior correlated to specific healthcare\social\political events. Fourth Chapter contains the study of ANOVA to evaluate effects on price and age (continuous variables) of different control factors (i.e. male, periodicity, region, ...). The Fifth Chapter deals with Clustering methods. Here, mixed data variable problem is faced. Indeed, three combinations of technique are used: Gower Distance & PAM (or CLARA), FAMD (dimensionality reduction) & Hierarchical and K-prototype. Sixth Chapter, the final one, proposes the classification topic to forecast possible customers to suggest the products of interest. Classical algorithms are implemented. This pilot project can be a starting point for future improvements in order to increase the quality of results and to explore better and more suitable algorithms\ techniques.

This master’s thesis has the purpose to analyze a dataset supplied by Banca di Asti, a bank situated in Northern Italy, throught the use of RStudio software. The final goal is to capture the main features of customers drafting two type of insurance products: healthy and home insurance. In the set, customers can be both individuals or societies\organizations. The healthy insurance is a new item on the brochure; thus, the decommissioned ones are included in the study considering them as an integrated continuation of the newest. The First Chapter is an Introduction where the Bank’ s Management Data System and the definition of CRISP-DM criterion (Objectives Setting, Data Understanding, Data Preparation, Pattern Mining & Model Building, Evaluation and Deployment) are explained. Dealing with AI - in general with Data Mining (DM) - reporting some critiques and awareness on its use (Data Quality, Costs, Privacy, Human Rights, ...) is necessary. In the Second Chapter Lloyd Algorithm is included to plot geographically bank’s branches through the so- called Voronoi tessellation. It is followed by the Data Collection section where the main characteristics of the dataset are highlighted: variables and their type, number of observations and initial problems. It ends with all the Data Cleaning steps during which classical problem are treated: type changing, typo substitutions, data mismatching, NA (or missing) value replacement and management of categorical variables. Here, the Extraction & Transformation technique is exploited to deal with the anagraphic variables; some are not directly in the historical archive - needed to locally describe customers around drafting date. Third Chapter is about Distributions of data. Information like geographical dispersion in the country, frequency of anagraphic features of customers (i.e. number of male\female, number of societies\individuals, most frequency of payments chosen, ...) are scouted. Trend in the price of contracts over time is reported to see particular behavior correlated to specific healthcare\social\political events. Fourth Chapter contains the study of ANOVA to evaluate effects on price and age (continuous variables) of different control factors (i.e. male, periodicity, region, ...). The Fifth Chapter deals with Clustering methods. Here, mixed data variable problem is faced. Indeed, three combinations of technique are used: Gower Distance & PAM (or CLARA), FAMD (dimensionality reduction) & Hierarchical and K-prototype. Sixth Chapter, the final one, proposes the classification topic to forecast possible customers to suggest the products of interest. Classical algorithms are implemented. This pilot project can be a starting point for future improvements in order to increase the quality of results and to explore better and more suitable algorithms\ techniques.

Data Mining methods & Statistical analysis applied on Insurance Products

ROGGERO, ILARIA
2023/2024

Abstract

This master’s thesis has the purpose to analyze a dataset supplied by Banca di Asti, a bank situated in Northern Italy, throught the use of RStudio software. The final goal is to capture the main features of customers drafting two type of insurance products: healthy and home insurance. In the set, customers can be both individuals or societies\organizations. The healthy insurance is a new item on the brochure; thus, the decommissioned ones are included in the study considering them as an integrated continuation of the newest. The First Chapter is an Introduction where the Bank’ s Management Data System and the definition of CRISP-DM criterion (Objectives Setting, Data Understanding, Data Preparation, Pattern Mining & Model Building, Evaluation and Deployment) are explained. Dealing with AI - in general with Data Mining (DM) - reporting some critiques and awareness on its use (Data Quality, Costs, Privacy, Human Rights, ...) is necessary. In the Second Chapter Lloyd Algorithm is included to plot geographically bank’s branches through the so- called Voronoi tessellation. It is followed by the Data Collection section where the main characteristics of the dataset are highlighted: variables and their type, number of observations and initial problems. It ends with all the Data Cleaning steps during which classical problem are treated: type changing, typo substitutions, data mismatching, NA (or missing) value replacement and management of categorical variables. Here, the Extraction & Transformation technique is exploited to deal with the anagraphic variables; some are not directly in the historical archive - needed to locally describe customers around drafting date. Third Chapter is about Distributions of data. Information like geographical dispersion in the country, frequency of anagraphic features of customers (i.e. number of male\female, number of societies\individuals, most frequency of payments chosen, ...) are scouted. Trend in the price of contracts over time is reported to see particular behavior correlated to specific healthcare\social\political events. Fourth Chapter contains the study of ANOVA to evaluate effects on price and age (continuous variables) of different control factors (i.e. male, periodicity, region, ...). The Fifth Chapter deals with Clustering methods. Here, mixed data variable problem is faced. Indeed, three combinations of technique are used: Gower Distance & PAM (or CLARA), FAMD (dimensionality reduction) & Hierarchical and K-prototype. Sixth Chapter, the final one, proposes the classification topic to forecast possible customers to suggest the products of interest. Classical algorithms are implemented. This pilot project can be a starting point for future improvements in order to increase the quality of results and to explore better and more suitable algorithms\ techniques.
Data Mining methods & Statistical analysis applied on Insurance Products
This master’s thesis has the purpose to analyze a dataset supplied by Banca di Asti, a bank situated in Northern Italy, throught the use of RStudio software. The final goal is to capture the main features of customers drafting two type of insurance products: healthy and home insurance. In the set, customers can be both individuals or societies\organizations. The healthy insurance is a new item on the brochure; thus, the decommissioned ones are included in the study considering them as an integrated continuation of the newest. The First Chapter is an Introduction where the Bank’ s Management Data System and the definition of CRISP-DM criterion (Objectives Setting, Data Understanding, Data Preparation, Pattern Mining & Model Building, Evaluation and Deployment) are explained. Dealing with AI - in general with Data Mining (DM) - reporting some critiques and awareness on its use (Data Quality, Costs, Privacy, Human Rights, ...) is necessary. In the Second Chapter Lloyd Algorithm is included to plot geographically bank’s branches through the so- called Voronoi tessellation. It is followed by the Data Collection section where the main characteristics of the dataset are highlighted: variables and their type, number of observations and initial problems. It ends with all the Data Cleaning steps during which classical problem are treated: type changing, typo substitutions, data mismatching, NA (or missing) value replacement and management of categorical variables. Here, the Extraction & Transformation technique is exploited to deal with the anagraphic variables; some are not directly in the historical archive - needed to locally describe customers around drafting date. Third Chapter is about Distributions of data. Information like geographical dispersion in the country, frequency of anagraphic features of customers (i.e. number of male\female, number of societies\individuals, most frequency of payments chosen, ...) are scouted. Trend in the price of contracts over time is reported to see particular behavior correlated to specific healthcare\social\political events. Fourth Chapter contains the study of ANOVA to evaluate effects on price and age (continuous variables) of different control factors (i.e. male, periodicity, region, ...). The Fifth Chapter deals with Clustering methods. Here, mixed data variable problem is faced. Indeed, three combinations of technique are used: Gower Distance & PAM (or CLARA), FAMD (dimensionality reduction) & Hierarchical and K-prototype. Sixth Chapter, the final one, proposes the classification topic to forecast possible customers to suggest the products of interest. Classical algorithms are implemented. This pilot project can be a starting point for future improvements in order to increase the quality of results and to explore better and more suitable algorithms\ techniques.
Non autorizzo consultazione esterna dell'elaborato
File in questo prodotto:
File Dimensione Formato  
IlariaRoggero_SDS.pdf

non disponibili

Descrizione: Caricamento definitivo tesi
Dimensione 15.56 MB
Formato Adobe PDF
15.56 MB Adobe PDF

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14240/9198