The Natural Language Processing, or NLP, is a broad area of research addressing the automatic understanding of human language. It has many applications, concerning both written texts and spoken dialogues. In this thesis we focus our attention on topic modeling, the NLP subfield that analyzes documents in order to discover the topics covered. While there is a great deal of literature on basic algorithms, few researches cover more advanced methods able to spot hierarchical relationships among the extracted topics or track their evolution over time. Furthermore, there hardly exist studies that integrate all these elements, providing an exhaustive exploration of the document corpus. To address this gap, we conducted first an extended and structured literature analysis through which we identified the most performing and fitting algorithms. Then, we used them to develop a complete experimental framework able to extract topics from huge collections of documents, organize them in hierarchical structures, analyze their temporal dynamics and also show the obtained results through an interactive and easy-to-use visual platform. To demonstrate the usefulness of such a system, we conducted a case study on a large dataset of news articles collected from the GDELT database that covers a period of more than two months. The final outputs are overall satisfactory, proving that our method is an effective tool to detect trends and patterns in large digital news archives and to make them easy to understand for general users.
Elaborazione del linguaggio naturale per la modellazione gerarchica e temporale di argomenti
ANTONOZZI, LUDOVICA
2018/2019
Abstract
The Natural Language Processing, or NLP, is a broad area of research addressing the automatic understanding of human language. It has many applications, concerning both written texts and spoken dialogues. In this thesis we focus our attention on topic modeling, the NLP subfield that analyzes documents in order to discover the topics covered. While there is a great deal of literature on basic algorithms, few researches cover more advanced methods able to spot hierarchical relationships among the extracted topics or track their evolution over time. Furthermore, there hardly exist studies that integrate all these elements, providing an exhaustive exploration of the document corpus. To address this gap, we conducted first an extended and structured literature analysis through which we identified the most performing and fitting algorithms. Then, we used them to develop a complete experimental framework able to extract topics from huge collections of documents, organize them in hierarchical structures, analyze their temporal dynamics and also show the obtained results through an interactive and easy-to-use visual platform. To demonstrate the usefulness of such a system, we conducted a case study on a large dataset of news articles collected from the GDELT database that covers a period of more than two months. The final outputs are overall satisfactory, proving that our method is an effective tool to detect trends and patterns in large digital news archives and to make them easy to understand for general users.File | Dimensione | Formato | |
---|---|---|---|
874796_tesi_antonozzi_ludovica.pdf
non disponibili
Tipologia:
Altro materiale allegato
Dimensione
7.52 MB
Formato
Adobe PDF
|
7.52 MB | Adobe PDF |
I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/20.500.14240/51795