In my thesis I focus on the issue of keywords extraction. Also, I propose a novel method for keywords extraction: keywords can be extracted from text documents by considering in how far concepts underlying document terms are relevant to title concepts. Although the proposed approach makes some assumptions that do not fit with some document genres (such as novels), it does not require re- training classifiers, to build support corpora, it is domain-independent. Even more importantly, my approach is based on lexical semantics, which at the best of my knowledge had never been attempted before. The title-body conceptual centrality is investigated as the chief factor to extract keywords. I propose five metrics, that are different in essence, to compute the centrality of concepts in the document body with respect to those in the title. I report about an experimentation over a popular dataset of human annotated news articles; the results confirm the soundness of our hypothesis. Also, I outline an application for semantic browsing that relies on the keywords extraction system. The preliminary evaluation corroborates the hypothesis that keywords can be helpful also in this sort of task.
Analisi lessicale e sintattica per l'indicizzazione, la creazione di descrittori semantici e la ricerca in collezioni di documenti testuali
COLLA, DAVIDE
2016/2017
Abstract
In my thesis I focus on the issue of keywords extraction. Also, I propose a novel method for keywords extraction: keywords can be extracted from text documents by considering in how far concepts underlying document terms are relevant to title concepts. Although the proposed approach makes some assumptions that do not fit with some document genres (such as novels), it does not require re- training classifiers, to build support corpora, it is domain-independent. Even more importantly, my approach is based on lexical semantics, which at the best of my knowledge had never been attempted before. The title-body conceptual centrality is investigated as the chief factor to extract keywords. I propose five metrics, that are different in essence, to compute the centrality of concepts in the document body with respect to those in the title. I report about an experimentation over a popular dataset of human annotated news articles; the results confirm the soundness of our hypothesis. Also, I outline an application for semantic browsing that relies on the keywords extraction system. The preliminary evaluation corroborates the hypothesis that keywords can be helpful also in this sort of task.File | Dimensione | Formato | |
---|---|---|---|
745852_tesi_magistrale_davide_colla.pdf
non disponibili
Tipologia:
Altro materiale allegato
Dimensione
542.37 kB
Formato
Adobe PDF
|
542.37 kB | Adobe PDF |
I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/20.500.14240/52139