Analisi lessicale e sintattica per l'indicizzazione, la creazione di descrittori semantici e la ricerca in collezioni di documenti testuali

In my thesis I focus on the issue of keywords extraction. Also, I propose a novel method for keywords extraction: keywords can be extracted from text documents by considering in how far concepts underlying document terms are relevant to title concepts. Although the proposed approach makes some assumptions that do not fit with some document genres (such as novels), it does not require re- training classifiers, to build support corpora, it is domain-independent. Even more importantly, my approach is based on lexical semantics, which at the best of my knowledge had never been attempted before. The title-body conceptual centrality is investigated as the chief factor to extract keywords. I propose five metrics, that are different in essence, to compute the centrality of concepts in the document body with respect to those in the title. I report about an experimentation over a popular dataset of human annotated news articles; the results confirm the soundness of our hypothesis. Also, I outline an application for semantic browsing that relies on the keywords extraction system. The preliminary evaluation corroborates the hypothesis that keywords can be helpful also in this sort of task.

Analisi lessicale e sintattica per l'indicizzazione, la creazione di descrittori semantici e la ricerca in collezioni di documenti testuali

COLLA, DAVIDE

2016/2017

Abstract

In my thesis I focus on the issue of keywords extraction. Also, I propose a novel method for keywords extraction: keywords can be extracted from text documents by considering in how far concepts underlying document terms are relevant to title concepts. Although the proposed approach makes some assumptions that do not fit with some document genres (such as novels), it does not require re- training classifiers, to build support corpora, it is domain-independent. Even more importantly, my approach is based on lexical semantics, which at the best of my knowledge had never been attempted before. The title-body conceptual centrality is investigated as the chief factor to extract keywords. I propose five metrics, that are different in essence, to compute the centrality of concepts in the document body with respect to those in the title. I report about an experimentation over a popular dataset of human annotated news articles; the results confirm the soundness of our hypothesis. Also, I outline an application for semantic browsing that relies on the keywords extraction system. The preliminary evaluation corroborates the hypothesis that keywords can be helpful also in this sort of task.

Scheda breve

	Facoltà/Dipartimento
	
				INFORMATICA
			
	Corso di studio
	
				INFORMATICA
			
	Lingua
	
				ENG
			
	Relatrice / Relatore
	
				RADICIONI, Daniele Paolo
			
	Modalità consultazione tesi
	
				IMPORT DA TESIONLINE
			
	Appare nelle tipologie:
	
				Corso di Laurea Magistrale

File in questo prodotto:

File	Dimensione	Formato
745852_tesi_magistrale_davide_colla.pdf non disponibili Tipologia: Altro materiale allegato Dimensione 542.37 kB Formato Adobe PDF	542.37 kB	Adobe PDF

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14240/52139