Data Mining methods & Statistical analysis applied on Insurance Products

This master’s thesis has the purpose to analyze a dataset supplied by Banca di Asti, a bank situated in Northern Italy, throught the use of RStudio software. The final goal is to capture the main features of customers drafting two type of insurance products: healthy and home insurance. In the set, customers can be both individuals or societies\organizations. The healthy insurance is a new item on the brochure; thus, the decommissioned ones are included in the study considering them as an integrated continuation of the newest. The First Chapter is an Introduction where the Bank’ s Management Data System and the definition of CRISP-DM criterion (Objectives Setting, Data Understanding, Data Preparation, Pattern Mining & Model Building, Evaluation and Deployment) are explained. Dealing with AI - in general with Data Mining (DM) - reporting some critiques and awareness on its use (Data Quality, Costs, Privacy, Human Rights, ...) is necessary. In the Second Chapter Lloyd Algorithm is included to plot geographically bank’s branches through the so- called Voronoi tessellation. It is followed by the Data Collection section where the main characteristics of the dataset are highlighted: variables and their type, number of observations and initial problems. It ends with all the Data Cleaning steps during which classical problem are treated: type changing, typo substitutions, data mismatching, NA (or missing) value replacement and management of categorical variables. Here, the Extraction & Transformation technique is exploited to deal with the anagraphic variables; some are not directly in the historical archive - needed to locally describe customers around drafting date. Third Chapter is about Distributions of data. Information like geographical dispersion in the country, frequency of anagraphic features of customers (i.e. number of male\female, number of societies\individuals, most frequency of payments chosen, ...) are scouted. Trend in the price of contracts over time is reported to see particular behavior correlated to specific healthcare\social\political events. Fourth Chapter contains the study of ANOVA to evaluate effects on price and age (continuous variables) of different control factors (i.e. male, periodicity, region, ...). The Fifth Chapter deals with Clustering methods. Here, mixed data variable problem is faced. Indeed, three combinations of technique are used: Gower Distance & PAM (or CLARA), FAMD (dimensionality reduction) & Hierarchical and K-prototype. Sixth Chapter, the final one, proposes the classification topic to forecast possible customers to suggest the products of interest. Classical algorithms are implemented. This pilot project can be a starting point for future improvements in order to increase the quality of results and to explore better and more suitable algorithms\ techniques.

Data Mining methods & Statistical analysis applied on Insurance Products

ROGGERO, ILARIA

2023/2024

Abstract

This master’s thesis has the purpose to analyze a dataset supplied by Banca di Asti, a bank situated in Northern Italy, throught the use of RStudio software. The final goal is to capture the main features of customers drafting two type of insurance products: healthy and home insurance. In the set, customers can be both individuals or societies\organizations. The healthy insurance is a new item on the brochure; thus, the decommissioned ones are included in the study considering them as an integrated continuation of the newest. The First Chapter is an Introduction where the Bank’ s Management Data System and the definition of CRISP-DM criterion (Objectives Setting, Data Understanding, Data Preparation, Pattern Mining & Model Building, Evaluation and Deployment) are explained. Dealing with AI - in general with Data Mining (DM) - reporting some critiques and awareness on its use (Data Quality, Costs, Privacy, Human Rights, ...) is necessary. In the Second Chapter Lloyd Algorithm is included to plot geographically bank’s branches through the so- called Voronoi tessellation. It is followed by the Data Collection section where the main characteristics of the dataset are highlighted: variables and their type, number of observations and initial problems. It ends with all the Data Cleaning steps during which classical problem are treated: type changing, typo substitutions, data mismatching, NA (or missing) value replacement and management of categorical variables. Here, the Extraction & Transformation technique is exploited to deal with the anagraphic variables; some are not directly in the historical archive - needed to locally describe customers around drafting date. Third Chapter is about Distributions of data. Information like geographical dispersion in the country, frequency of anagraphic features of customers (i.e. number of male\female, number of societies\individuals, most frequency of payments chosen, ...) are scouted. Trend in the price of contracts over time is reported to see particular behavior correlated to specific healthcare\social\political events. Fourth Chapter contains the study of ANOVA to evaluate effects on price and age (continuous variables) of different control factors (i.e. male, periodicity, region, ...). The Fifth Chapter deals with Clustering methods. Here, mixed data variable problem is faced. Indeed, three combinations of technique are used: Gower Distance & PAM (or CLARA), FAMD (dimensionality reduction) & Hierarchical and K-prototype. Sixth Chapter, the final one, proposes the classification topic to forecast possible customers to suggest the products of interest. Classical algorithms are implemented. This pilot project can be a starting point for future improvements in order to increase the quality of results and to explore better and more suitable algorithms\ techniques.

Scheda breve

	Facoltà/Dipartimento
	
				MATEMATICA "GIUSEPPE PEANO"
			
	Corso di studio
	
				STOCHASTICS AND DATA SCIENCE
			
	Titolo inglese
	
				Data Mining methods & Statistical analysis applied on Insurance Products
			
	Abstract in inglese
	
				This master’s thesis has the purpose to analyze a dataset supplied by Banca di Asti, a bank situated in
Northern Italy, throught the use of RStudio software. The final goal is to capture the main features of customers drafting two type of insurance products: healthy and home insurance. In the set, customers can be both individuals or societies\organizations.
The healthy insurance is a new item on the brochure; thus, the decommissioned ones are included in the
study considering them as an integrated continuation of the newest.

The First Chapter is an Introduction where  the Bank’ s Management Data System and the
definition of CRISP-DM criterion (Objectives Setting, Data Understanding, Data Preparation, Pattern Mining
& Model Building, Evaluation and Deployment) are explained. Dealing with AI - in general with Data Mining
(DM) - reporting some critiques and awareness on its use (Data Quality, Costs, Privacy,
Human Rights, ...) is necessary.

In the Second Chapter Lloyd Algorithm is included to plot geographically bank’s branches through the so-
called Voronoi tessellation. It is followed by the Data Collection section where the main characteristics of the
dataset are highlighted: variables and their type, number of observations and initial problems. It ends with all the Data Cleaning steps during which classical problem are treated: type changing, typo substitutions, data mismatching, NA (or missing) value replacement and management of categorical variables. Here, the Extraction & Transformation technique is exploited to deal with the anagraphic variables; some are not directly in the historical archive - needed to locally describe customers around drafting date.

Third Chapter is about Distributions of data. Information like geographical dispersion in the country,
frequency of anagraphic features of customers (i.e. number of male\female, number of
societies\individuals, most frequency of payments chosen, ...) are scouted. Trend in the price of contracts over
time is reported to see particular behavior correlated to specific healthcare\social\political events.

Fourth Chapter contains the study of ANOVA to evaluate effects on price and age (continuous variables) of
different control factors (i.e. male, periodicity, region, ...).

The Fifth Chapter deals with Clustering methods. Here, mixed data variable problem is faced. Indeed, three
combinations of technique are used: Gower Distance & PAM (or CLARA), FAMD (dimensionality reduction)
& Hierarchical and K-prototype.

Sixth Chapter, the final one, proposes the classification topic to forecast possible customers to suggest the
products of interest. Classical algorithms are implemented.

This pilot project can be a starting point for future improvements in order to increase the quality of
results and to explore better and more suitable algorithms\ techniques.
			
	Relatrice / Relatore
	
				POLATO, MIRKO
ZUCCA, CRISTINA
			
	Modalità consultazione tesi
	
				Non autorizzo consultazione esterna dell'elaborato
			
	Appare nelle tipologie:
	
				Corso di Laurea Magistrale

File in questo prodotto:

File	Dimensione	Formato
IlariaRoggero_SDS.pdf non disponibili Descrizione: Caricamento definitivo tesi Dimensione 15.56 MB Formato Adobe PDF	15.56 MB	Adobe PDF

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14240/9198