One of the main objectives of each company is to attract as many customers as possible. Equally important is the loyalty strategy that they must implement to maintain customers. Given the scarcity of resources that characterizes the economic environment in which each company is immersed, it is useful to find a strategy to identify customers with the highest probability of not renewing existing contracts and identify the reasons for which this event, called churn, has a high probability of happening. After identifying these customers and the characteristics for which they can be considered risky, it is the responsibility of the company's management to devise and implement solutions to deal with this problem. The business problem described above immediately translates into a statistical-predictive problem: given a customer in a specific period of time, with which a series of features are associated, it is necessary to calculate the probability that the event "such customer will not renew existing contracts in the next period of time" occurs. The aim of this work is to describe how this problem was addressed in the specific case of the Cerved company, operating in the business information sector. Having available a data set including information from Cerved customers over the past four years and the "churn" or "non churn" label associated with each of these, it was possible to estimate two models, the classic Logistic Regression and a Machine Learning model called Extreme Gradient Boosting, based on the aggregation of decision trees whose structure is iteratively estimated. The theoretical details of the two models and the technical details relating to their implementation are reported, together with an accurate description of the data set used for the estimation and validation of the above models. The two models will be compared with respect to the accuracy ratio, the metric that in this project measures the predictive ability of the techniques used. Given a new customer input, the models are therefore able to return the estimated probability of churn and the features that have most influenced this result; thanks to this, it is possible to direct company resources only to the customers that the model considers most risky and to act only on the most significant variables.

One of the main objectives of each company is to attract as many customers as possible. Equally important is the loyalty strategy that they must implement to maintain customers. Given the scarcity of resources that characterizes the economic environment in which each company is immersed, it is useful to find a strategy to identify customers with the highest probability of not renewing existing contracts and identify the reasons for which this event, called churn, has a high probability of happening. After identifying these customers and the characteristics for which they can be considered risky, it is the responsibility of the company's management to devise and implement solutions to deal with this problem. The business problem described above immediately translates into a statistical-predictive problem: given a customer in a specific period of time, with which a series of features are associated, it is necessary to calculate the probability that the event "such customer will not renew existing contracts in the next period of time" occurs. The aim of this work is to describe how this problem was addressed in the specific case of the Cerved company, operating in the business information sector. Having available a data set including information from Cerved customers over the past four years and the "churn" or "non churn" label associated with each of these, it was possible to estimate two models, the classic Logistic Regression and a Machine Learning model called Extreme Gradient Boosting, based on the aggregation of decision trees whose structure is iteratively estimated. The theoretical details of the two models and the technical details relating to their implementation are reported, together with an accurate description of the data set used for the estimation and validation of the above models. The two models will be compared with respect to the accuracy ratio, the metric that in this project measures the predictive ability of the techniques used. Given a new customer input, the models are therefore able to return the estimated probability of churn and the features that have most influenced this result; thanks to this, it is possible to direct company resources only to the customers that the model considers most risky and to act only on the most significant variables.

Two predictive models for binary classification: logistic regression and Extreme Gradient Boosting. Theory, application and commercial consequences

TURTURRO, FRANCESCO
2018/2019

Abstract

One of the main objectives of each company is to attract as many customers as possible. Equally important is the loyalty strategy that they must implement to maintain customers. Given the scarcity of resources that characterizes the economic environment in which each company is immersed, it is useful to find a strategy to identify customers with the highest probability of not renewing existing contracts and identify the reasons for which this event, called churn, has a high probability of happening. After identifying these customers and the characteristics for which they can be considered risky, it is the responsibility of the company's management to devise and implement solutions to deal with this problem. The business problem described above immediately translates into a statistical-predictive problem: given a customer in a specific period of time, with which a series of features are associated, it is necessary to calculate the probability that the event "such customer will not renew existing contracts in the next period of time" occurs. The aim of this work is to describe how this problem was addressed in the specific case of the Cerved company, operating in the business information sector. Having available a data set including information from Cerved customers over the past four years and the "churn" or "non churn" label associated with each of these, it was possible to estimate two models, the classic Logistic Regression and a Machine Learning model called Extreme Gradient Boosting, based on the aggregation of decision trees whose structure is iteratively estimated. The theoretical details of the two models and the technical details relating to their implementation are reported, together with an accurate description of the data set used for the estimation and validation of the above models. The two models will be compared with respect to the accuracy ratio, the metric that in this project measures the predictive ability of the techniques used. Given a new customer input, the models are therefore able to return the estimated probability of churn and the features that have most influenced this result; thanks to this, it is possible to direct company resources only to the customers that the model considers most risky and to act only on the most significant variables.
Two predictive models for binary classification: logistic regression and Extreme Gradient Boosting. Theory, application and commercial consequences
One of the main objectives of each company is to attract as many customers as possible. Equally important is the loyalty strategy that they must implement to maintain customers. Given the scarcity of resources that characterizes the economic environment in which each company is immersed, it is useful to find a strategy to identify customers with the highest probability of not renewing existing contracts and identify the reasons for which this event, called churn, has a high probability of happening. After identifying these customers and the characteristics for which they can be considered risky, it is the responsibility of the company's management to devise and implement solutions to deal with this problem. The business problem described above immediately translates into a statistical-predictive problem: given a customer in a specific period of time, with which a series of features are associated, it is necessary to calculate the probability that the event "such customer will not renew existing contracts in the next period of time" occurs. The aim of this work is to describe how this problem was addressed in the specific case of the Cerved company, operating in the business information sector. Having available a data set including information from Cerved customers over the past four years and the "churn" or "non churn" label associated with each of these, it was possible to estimate two models, the classic Logistic Regression and a Machine Learning model called Extreme Gradient Boosting, based on the aggregation of decision trees whose structure is iteratively estimated. The theoretical details of the two models and the technical details relating to their implementation are reported, together with an accurate description of the data set used for the estimation and validation of the above models. The two models will be compared with respect to the accuracy ratio, the metric that in this project measures the predictive ability of the techniques used. Given a new customer input, the models are therefore able to return the estimated probability of churn and the features that have most influenced this result; thanks to this, it is possible to direct company resources only to the customers that the model considers most risky and to act only on the most significant variables.
IMPORT TESI SOLO SU ESSE3 DAL 2018
File in questo prodotto:
File Dimensione Formato  
Tesi Turturro.pdf

non disponibili

Descrizione: Two predictive models for binary classication: Logistic Regression and Extreme Gradient Boosting. Theory, application and commercial consequences
Dimensione 572.68 kB
Formato Adobe PDF
572.68 kB Adobe PDF

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14240/1941