Correzione ed estensione di abbreviazioni in testi clinici mediante l’uso di Language Models

The thesis work carried out is part of the Circular Health For Industry (CH4I) project, whose goal is to collect, manage and analyze data in order to make predictions of medical interest. This project’s goal is to introduce the use of Artificial Intelligence technologies to improve the organization of particular industries such as "Human Healthcare, Animal Welfare and Agrifood Safety" with the aim of having sustainable development and progress. The thesis project therefore aims to use Natural Language Processing (NLP) techniques to first identify and then extract useful information from the main source of information available in the clinical setting: the text. In fact, the analysis of dirty texts, with numerous errors and non-standard abbreviations, is not easy to implement. For this reason, Language Models (LM), suitably trained on this particular type of text, were used to interpret the documents by identifying the acronyms and errors present. In the first of the two tasks carried out in this project, namely the extension of acronyms, the model scans the input sentences word by word and extends, if necessary, the acronyms encountered according to the context. This is done with the help of an acronyms dictionary within which you can find the extensions relating to each of them. The same happens for the second task, namely the correction one, in which the model, once the error has been identified, evaluates a set of candidate corrections. In this case, however, the model does not have a dictionary available as in the first task. The results obtained from the executions made it possible to note that in general there is an improvement in the models in which a LM adapted to the clinical setting is applied, althoughwith some limitations in the presence of real-word errors and uncommon abbreviations.

Il lavoro di tesi svolto si inserisce nel progetto Circular Health For Industry (CH4I), il cui obiettivo è quello di collezionare, gestire e analizzare dati per poter effettuare predizioni d’interesse medico. Il progetto punta quindi a introdurre l’uso di tecnologie di Intelligenza Artificiale per migliorare l’organizzazione di particolari industrie come "Human Healthcare, Animal Welfare e Agrifood Safety" con lo scopo di avere uno sviluppo ed un progresso sostenibile. Il progetto di tesi mira quindi ad utilizzare tecniche di Natural Language Processing (NLP) per individuare prima ed estrarre poi informazioni utili dalla principale fonte di informazione disponibile in ambito clinico: il testo. Infatti, l’analisi di testi sporchi, con numerosi errori e abbreviazioni non standard, non risulta facile da attuare. Per tale motivo ci si è avvalsi di Language Models (LM), opportunamente addestrati su questa particolare tipologia di testi, per interpretare i documenti andando ad individuare gli acronimi e gli errori presenti. Nel primo dei due task realizzati in questo progetto, ossia quello di estensione degli acronimi, il modello scansiona le frasi in input parola per parola ed estende, se necessario, gli acronimi incontrati a seconda del contesto. Questa operazione avviene con l’ausilio di un dizionario degli acronimi al cui interno è possibile trovare le estensioni relative ad ognuno di essi. Similmente accade per il secondo task, ossia quello di correzione, in cui il modello, una volta individuato l’errore, valuta un insieme di correzioni candidate. In questo caso però il modello non ha a disposizione un dizionario come nel primo task. I risultati ottenuti dalle esecuzioni hanno permesso di constatare che in generale c’è un miglioramento nei modelli in cui viene applicato un LM adattato all’ambito clinico, seppure con alcune limitazioni in presenza di real-word errors e abbreviazioni non comuni.