In my thesis I focus on the issue of keywords extraction. Also, I propose a novel method for keywords extraction: keywords can be extracted from text documents by considering in how far concepts underlying document terms are relevant to title concepts. Although the proposed approach makes some assumptions that do not fit with some document genres (such as novels), it does not require re- training classifiers, to build support corpora, it is domain-independent. Even more importantly, my approach is based on lexical semantics, which at the best of my knowledge had never been attempted before. The title-body conceptual centrality is investigated as the chief factor to extract keywords. I propose five metrics, that are different in essence, to compute the centrality of concepts in the document body with respect to those in the title. I report about an experimentation over a popular dataset of human annotated news articles; the results confirm the soundness of our hypothesis. Also, I outline an application for semantic browsing that relies on the keywords extraction system. The preliminary evaluation corroborates the hypothesis that keywords can be helpful also in this sort of task.

Analisi lessicale e sintattica per l'indicizzazione, la creazione di descrittori semantici e la ricerca in collezioni di documenti testuali

COLLA, DAVIDE
2016/2017

Abstract

In my thesis I focus on the issue of keywords extraction. Also, I propose a novel method for keywords extraction: keywords can be extracted from text documents by considering in how far concepts underlying document terms are relevant to title concepts. Although the proposed approach makes some assumptions that do not fit with some document genres (such as novels), it does not require re- training classifiers, to build support corpora, it is domain-independent. Even more importantly, my approach is based on lexical semantics, which at the best of my knowledge had never been attempted before. The title-body conceptual centrality is investigated as the chief factor to extract keywords. I propose five metrics, that are different in essence, to compute the centrality of concepts in the document body with respect to those in the title. I report about an experimentation over a popular dataset of human annotated news articles; the results confirm the soundness of our hypothesis. Also, I outline an application for semantic browsing that relies on the keywords extraction system. The preliminary evaluation corroborates the hypothesis that keywords can be helpful also in this sort of task.
ENG
IMPORT DA TESIONLINE
File in questo prodotto:
File Dimensione Formato  
745852_tesi_magistrale_davide_colla.pdf

non disponibili

Tipologia: Altro materiale allegato
Dimensione 542.37 kB
Formato Adobe PDF
542.37 kB Adobe PDF

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14240/52139