Combining Contrastive Learning and Knowledge Graph Embeddings to develop medical word embeddings for the Italian language

Word embeddings play a crucial role in the success of modern Natural Language Processing applications. Although pre-trained models can be utilized out-of-the-box, they may not be precise enough for specific languages or domains, and fine-tuning may be necessary. This work aims to improve available embeddings in the uncovered niche of the Italian biomedical domain, by combining Contrastive Learning (CL) and Knowledge Graph Embedding (KGE). The primary objective is to enhance the representations of medical terms and improve the accuracy of semantic similarity between them, which is also used as an evaluation task. To overcome the lack of medical texts and controlled vocabularies in the Italian language, a novel solution has been developed by merging preexisting CL methods, such as multi-similarity loss, contextualization, and dynamic sampling, with KGEs, thus producing a new variant of the loss function. The results obtained from this study are highly encouraging since they demonstrate a significant increase in performance compared to the initial model, while using a significantly lower amount of data. This innovative approach provides a promising direction for developing efficient embeddings in low-resource languages and domains.

Combining Contrastive Learning and Knowledge Graph Embeddings to develop medical word embeddings for the Italian language

AMORE BONDARENKO, DENYS

2021/2022

Abstract

Word embeddings play a crucial role in the success of modern Natural Language Processing applications. Although pre-trained models can be utilized out-of-the-box, they may not be precise enough for specific languages or domains, and fine-tuning may be necessary. This work aims to improve available embeddings in the uncovered niche of the Italian biomedical domain, by combining Contrastive Learning (CL) and Knowledge Graph Embedding (KGE). The primary objective is to enhance the representations of medical terms and improve the accuracy of semantic similarity between them, which is also used as an evaluation task. To overcome the lack of medical texts and controlled vocabularies in the Italian language, a novel solution has been developed by merging preexisting CL methods, such as multi-similarity loss, contextualization, and dynamic sampling, with KGEs, thus producing a new variant of the loss function. The results obtained from this study are highly encouraging since they demonstrate a significant increase in performance compared to the initial model, while using a significantly lower amount of data. This innovative approach provides a promising direction for developing efficient embeddings in low-resource languages and domains.

Scheda breve

	Facoltà/Dipartimento
	
				INFORMATICA
			
	Corso di studio
	
				INFORMATICA
			
	Lingua
	
				ENG
			
	Relatrice / Relatore
	
				DI CARO, Luigi
			
	Modalità consultazione tesi
	
				IMPORT DA TESIONLINE
			
	Appare nelle tipologie:
	
				Corso di Laurea Magistrale

File in questo prodotto:

File	Dimensione	Formato
838657_amore_bondarenko_tesi_magistrale.pdf non disponibili Tipologia: Altro materiale allegato Dimensione 6.88 MB Formato Adobe PDF	6.88 MB	Adobe PDF

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14240/51462