The Natural Language Processing, or NLP, is a broad area of research addressing the automatic understanding of human language. It has many applications, concerning both written texts and spoken dialogues. In this thesis we focus our attention on topic modeling, the NLP subfield that analyzes documents in order to discover the topics covered. While there is a great deal of literature on basic algorithms, few researches cover more advanced methods able to spot hierarchical relationships among the extracted topics or track their evolution over time. Furthermore, there hardly exist studies that integrate all these elements, providing an exhaustive exploration of the document corpus. To address this gap, we conducted first an extended and structured literature analysis through which we identified the most performing and fitting algorithms. Then, we used them to develop a complete experimental framework able to extract topics from huge collections of documents, organize them in hierarchical structures, analyze their temporal dynamics and also show the obtained results through an interactive and easy-to-use visual platform. To demonstrate the usefulness of such a system, we conducted a case study on a large dataset of news articles collected from the GDELT database that covers a period of more than two months. The final outputs are overall satisfactory, proving that our method is an effective tool to detect trends and patterns in large digital news archives and to make them easy to understand for general users.

Elaborazione del linguaggio naturale per la modellazione gerarchica e temporale di argomenti

ANTONOZZI, LUDOVICA
2018/2019

Abstract

The Natural Language Processing, or NLP, is a broad area of research addressing the automatic understanding of human language. It has many applications, concerning both written texts and spoken dialogues. In this thesis we focus our attention on topic modeling, the NLP subfield that analyzes documents in order to discover the topics covered. While there is a great deal of literature on basic algorithms, few researches cover more advanced methods able to spot hierarchical relationships among the extracted topics or track their evolution over time. Furthermore, there hardly exist studies that integrate all these elements, providing an exhaustive exploration of the document corpus. To address this gap, we conducted first an extended and structured literature analysis through which we identified the most performing and fitting algorithms. Then, we used them to develop a complete experimental framework able to extract topics from huge collections of documents, organize them in hierarchical structures, analyze their temporal dynamics and also show the obtained results through an interactive and easy-to-use visual platform. To demonstrate the usefulness of such a system, we conducted a case study on a large dataset of news articles collected from the GDELT database that covers a period of more than two months. The final outputs are overall satisfactory, proving that our method is an effective tool to detect trends and patterns in large digital news archives and to make them easy to understand for general users.
ENG
IMPORT DA TESIONLINE
File in questo prodotto:
File Dimensione Formato  
874796_tesi_antonozzi_ludovica.pdf

non disponibili

Tipologia: Altro materiale allegato
Dimensione 7.52 MB
Formato Adobe PDF
7.52 MB Adobe PDF

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14240/51795