Processing di Big Data con Pipeline ETL su Databricks

In the current data-driven landscape, organizations face the challenge of efficiently processing and analyzing large volumes of data for analytics and machine learning purposes. This thesis focuses on the usage of Databricks as a tool for constructing the Extract, Transform, Load (ETL) pipelines to facilitate the entire data processing workflow, from raw data ingestion to final stage data quality processing. Firstly, the concept of Big Data is introduced, highlighting its significance and impact in the present-day environment. Then, we delve into an exploration of Hadoop, Spark, and Databricks as powerful platforms that enable the management and processing of big data. These frameworks are discussed in terms of their capabilities, functionalities, and relevance in addressing the complexities of large-scale data processing. Then, this thesis presents a practical application scenario that incorporates industry best practices. The focus lies on implementing data quality techniques for measuring custom Key Performance Indicators (KPIs) to ensure reliable data analysis. Additionally, a data observability solution is introduced, which enables graphical monitoring of the ETL pipeline to provide real-time insights into its performance and health. Finally, a machine learning solution for data quality is presented as a future improvement on the project.

Processing di Big Data con Pipeline ETL su Databricks

FRISULLO, SILVESTRO STEFANO

2022/2023

Abstract

In the current data-driven landscape, organizations face the challenge of efficiently processing and analyzing large volumes of data for analytics and machine learning purposes. This thesis focuses on the usage of Databricks as a tool for constructing the Extract, Transform, Load (ETL) pipelines to facilitate the entire data processing workflow, from raw data ingestion to final stage data quality processing. Firstly, the concept of Big Data is introduced, highlighting its significance and impact in the present-day environment. Then, we delve into an exploration of Hadoop, Spark, and Databricks as powerful platforms that enable the management and processing of big data. These frameworks are discussed in terms of their capabilities, functionalities, and relevance in addressing the complexities of large-scale data processing. Then, this thesis presents a practical application scenario that incorporates industry best practices. The focus lies on implementing data quality techniques for measuring custom Key Performance Indicators (KPIs) to ensure reliable data analysis. Additionally, a data observability solution is introduced, which enables graphical monitoring of the ETL pipeline to provide real-time insights into its performance and health. Finally, a machine learning solution for data quality is presented as a future improvement on the project.

Scheda breve

	Facoltà/Dipartimento
	
				INFORMATICA
			
	Corso di studio
	
				INFORMATICA
			
	Lingua
	
				ENG
			
	Relatrice / Relatore
	
				BINI, Enrico
			
	Modalità consultazione tesi
	
				IMPORT DA TESIONLINE
			
	Appare nelle tipologie:
	
				Corso di Laurea Magistrale

File in questo prodotto:

File	Dimensione	Formato
832813_bigdataprocessingwithdatabricksetlpipelines.pdf non disponibili Tipologia: Altro materiale allegato Dimensione 4.31 MB Formato Adobe PDF	4.31 MB	Adobe PDF

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14240/145687