Sviluppo di metriche per il supporto dell'analisi semantica diacronica

Diachrony is the study of linguistic phenomena as they change over a period of time, so it is the complex of changes in a language or a linguistic element. The goal of this project is to perform a diachronic study on the Italian language, with Natural Language Processing, and use the results to obtain a value of recentness for words and possibly for an entire document. The fundamental support of this project was the Google Trends PyTrends library that allows, through its interface, to take advantage of the data collected from the Google Trends site. The data that can be obtained, allow to know the frequency of search on the web search engines of a specific word or phrase. In order to proceed, we chose some topics and, using a Python library, we searched them on Wikipedia. These searches have produced, for each topic, a document that will be the starting point of the following steps. On each document a textual processing phase was carried out, in which automatic word processing processes are applied, in particular grammatical and lexical analysis and some textual information extraction processes including Named Entity Recognition. This was possible thanks to the NLTK and SpaCy libraries, thus obtaining the most interesting words for the study. The words that passed this phase, were then searched on Google Trends to obtain data of their search frequency. This data was then processed in order to associate a recentness value to the words. Afterwards, based on the results obtained for all the words in the document, a recentness value for the topic can then be extracted. The results of the various topics can then be used for analysis and automatic recommendation systems based on recentness, both to provide suggestions for the most recent and the least searched.

La diacronia, è lo studio dei fenomeni linguistici nel loro sviluppo nel tempo, quindi è il complesso dei mutamenti di una lingua o di un elemento linguistico. L'obiettivo di questo progetto è di effettuare uno studio diacronico sulla lingua italiana, con il Natural Language Processing, e utilizzare i risultati per poter ottenere un valore di recentezza per le parole ed eventualmente per un intero documento. Il supporto fondamentale di questo progetto è stata la libreria di Google Trends PyTrends che permette, tramite la sua interfaccia, di usufruire dei dati raccolti dal sito di Google Trends. I dati che si possono ottenere, consentono di conoscere la frequenza di ricerca sui motori di ricerca del web di una determinata parola o frase. Per poter procedere, abbiamo scelto alcuni argomenti e, mediante una libreria di Python ne abbiamo effettuato la ricerca su Wikipedia. Tali ricerche hanno prodotto, per ciascun argomento, un documento che sarà il punto di partenza delle fasi successive. Su ciascun documento è stata svolta una fase di elaborazione testuale, in cui vengono applicati processi di trattamento automatico delle parole, in particolare l'analisi grammaticale, lessicale e alcuni processi di estrazione di informazioni testuali tra cui la Named Entity Recognition. Questo è stato possibile grazie alle librerie NLTK e SpaCy, ottenendo così le parole più interessanti per lo studio. Le parole che hanno superato questa fase, sono state successivamente cercate su Google Trends per ottenere i dati della loro frequenza di ricerca. Questi dati, sono stati poi elaborati per poter associare un valore di recentezza alle parole. Successivamente, sulla base dei risultati ottenuti per tutte le parole del documento, si potrà quindi estrarre un valore di recentezza per l'argomento. I risultati dei vari argomenti potranno essere utilizzati per analisi e sistemi di raccomandazione automatica basati sulla recentezza, sia per fornire suggerimenti riguardo i più recenti che per i meno ricercati.