In the current data-driven landscape, organizations face the challenge of efficiently processing and analyzing large volumes of data for analytics and machine learning purposes. This thesis focuses on the usage of Databricks as a tool for constructing the Extract, Transform, Load (ETL) pipelines to facilitate the entire data processing workflow, from raw data ingestion to final stage data quality processing. Firstly, the concept of Big Data is introduced, highlighting its significance and impact in the present-day environment. Then, we delve into an exploration of Hadoop, Spark, and Databricks as powerful platforms that enable the management and processing of big data. These frameworks are discussed in terms of their capabilities, functionalities, and relevance in addressing the complexities of large-scale data processing. Then, this thesis presents a practical application scenario that incorporates industry best practices. The focus lies on implementing data quality techniques for measuring custom Key Performance Indicators (KPIs) to ensure reliable data analysis. Additionally, a data observability solution is introduced, which enables graphical monitoring of the ETL pipeline to provide real-time insights into its performance and health. Finally, a machine learning solution for data quality is presented as a future improvement on the project.

Processing di Big Data con Pipeline ETL su Databricks

FRISULLO, SILVESTRO STEFANO
2022/2023

Abstract

In the current data-driven landscape, organizations face the challenge of efficiently processing and analyzing large volumes of data for analytics and machine learning purposes. This thesis focuses on the usage of Databricks as a tool for constructing the Extract, Transform, Load (ETL) pipelines to facilitate the entire data processing workflow, from raw data ingestion to final stage data quality processing. Firstly, the concept of Big Data is introduced, highlighting its significance and impact in the present-day environment. Then, we delve into an exploration of Hadoop, Spark, and Databricks as powerful platforms that enable the management and processing of big data. These frameworks are discussed in terms of their capabilities, functionalities, and relevance in addressing the complexities of large-scale data processing. Then, this thesis presents a practical application scenario that incorporates industry best practices. The focus lies on implementing data quality techniques for measuring custom Key Performance Indicators (KPIs) to ensure reliable data analysis. Additionally, a data observability solution is introduced, which enables graphical monitoring of the ETL pipeline to provide real-time insights into its performance and health. Finally, a machine learning solution for data quality is presented as a future improvement on the project.
ENG
IMPORT DA TESIONLINE
File in questo prodotto:
File Dimensione Formato  
832813_bigdataprocessingwithdatabricksetlpipelines.pdf

non disponibili

Tipologia: Altro materiale allegato
Dimensione 4.31 MB
Formato Adobe PDF
4.31 MB Adobe PDF

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14240/145687