Elaborazione del linguaggio naturale per la modellazione gerarchica e temporale di argomenti

The Natural Language Processing, or NLP, is a broad area of research addressing the automatic understanding of human language. It has many applications, concerning both written texts and spoken dialogues. In this thesis we focus our attention on topic modeling, the NLP subfield that analyzes documents in order to discover the topics covered. While there is a great deal of literature on basic algorithms, few researches cover more advanced methods able to spot hierarchical relationships among the extracted topics or track their evolution over time. Furthermore, there hardly exist studies that integrate all these elements, providing an exhaustive exploration of the document corpus. To address this gap, we conducted first an extended and structured literature analysis through which we identified the most performing and fitting algorithms. Then, we used them to develop a complete experimental framework able to extract topics from huge collections of documents, organize them in hierarchical structures, analyze their temporal dynamics and also show the obtained results through an interactive and easy-to-use visual platform. To demonstrate the usefulness of such a system, we conducted a case study on a large dataset of news articles collected from the GDELT database that covers a period of more than two months. The final outputs are overall satisfactory, proving that our method is an effective tool to detect trends and patterns in large digital news archives and to make them easy to understand for general users.

Elaborazione del linguaggio naturale per la modellazione gerarchica e temporale di argomenti

ANTONOZZI, LUDOVICA

2018/2019

Abstract

The Natural Language Processing, or NLP, is a broad area of research addressing the automatic understanding of human language. It has many applications, concerning both written texts and spoken dialogues. In this thesis we focus our attention on topic modeling, the NLP subfield that analyzes documents in order to discover the topics covered. While there is a great deal of literature on basic algorithms, few researches cover more advanced methods able to spot hierarchical relationships among the extracted topics or track their evolution over time. Furthermore, there hardly exist studies that integrate all these elements, providing an exhaustive exploration of the document corpus. To address this gap, we conducted first an extended and structured literature analysis through which we identified the most performing and fitting algorithms. Then, we used them to develop a complete experimental framework able to extract topics from huge collections of documents, organize them in hierarchical structures, analyze their temporal dynamics and also show the obtained results through an interactive and easy-to-use visual platform. To demonstrate the usefulness of such a system, we conducted a case study on a large dataset of news articles collected from the GDELT database that covers a period of more than two months. The final outputs are overall satisfactory, proving that our method is an effective tool to detect trends and patterns in large digital news archives and to make them easy to understand for general users.

Scheda breve

	Facoltà/Dipartimento
	
				FISICA
			
	Corso di studio
	
				FISICA DEI SISTEMI COMPLESSI
			
	Lingua
	
				ENG
			
	Relatrice / Relatore
	
				PANISSON, Andre'
			
	Modalità consultazione tesi
	
				IMPORT DA TESIONLINE
			
	Appare nelle tipologie:
	
				Corso di Laurea Magistrale

File in questo prodotto:

File	Dimensione	Formato
874796_tesi_antonozzi_ludovica.pdf non disponibili Tipologia: Altro materiale allegato Dimensione 7.52 MB Formato Adobe PDF	7.52 MB	Adobe PDF

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14240/51795