This thesis presents the Vulnerable Identities Recognition Corpus (VIRC), a novel dataset for hate speech analysis consisting of 880 news headlines in Italian and Spanish, with a focus on racist content. The corpus employs a multi-layered annotation scheme that combines Named Entity Recognition with the identification of vulnerable identities, dangerous speech, and derogatory mentions. This approach enables a more nuanced understanding of how discriminatory discourse operates in news media. The research includes a comprehensive evaluation of inter-annotator agreement, revealing strong consistency in identifying vulnerable identities but highlighting challenges in annotating some forms of harmful content. Through detailed linguistic analysis, the study identifies four primary semantic frames through which discriminatory discourse is constructed. These frames demonstrate how news discourse systematically constructs and perpetuates social vulnerability. The study includes experimental evaluations using Large Language Models (T5 and BART) in zero-shot settings, assessing their capability to automatically detect different types of annotated content. Based on these initial results, the corpus was expanded with additional data and annotators, using revised guidelines to address identified challenges. The research concludes with critical reflections on the ethical implications of hate speech annotation, examining the impact on annotators and the broader responsibilities in developing computational linguistics resources for studying harmful content. This work contributes to hate speech detection by providing a valuable resource for Italian and Spanish and introducing a novel annotation scheme that captures the complex ways in which discriminatory discourse operates in news media by focusing on its target.

This thesis presents the Vulnerable Identities Recognition Corpus (VIRC), a novel dataset for hate speech analysis consisting of 880 news headlines in Italian and Spanish, with a focus on racist content. The corpus employs a multi-layered annotation scheme that combines Named Entity Recognition with the identification of vulnerable identities, dangerous speech, and derogatory mentions. This approach enables a more nuanced understanding of how discriminatory discourse operates in news media. The research includes a comprehensive evaluation of inter-annotator agreement, revealing strong consistency in identifying vulnerable identities but highlighting challenges in annotating some forms of harmful content. Through detailed linguistic analysis, the study identifies four primary semantic frames through which discriminatory discourse is constructed. These frames demonstrate how news discourse systematically constructs and perpetuates social vulnerability. The study includes experimental evaluations using Large Language Models (T5 and BART) in zero-shot settings, assessing their capability to automatically detect different types of annotated content. Based on these initial results, the corpus was expanded with additional data and annotators, using revised guidelines to address identified challenges. The research concludes with critical reflections on the ethical implications of hate speech annotation, examining the impact on annotators and the broader responsibilities in developing computational linguistics resources for studying harmful content. This work contributes to hate speech detection by providing a valuable resource for Italian and Spanish and introducing a novel annotation scheme that captures the complex ways in which discriminatory discourse operates in news media by focusing on its target.

Recognizing Vulnerable Identities in Italian and Spanish News Headlines: a New Corpus for a Fine-Grained Analysis of Hate Speech

LONGO, ARIANNA
2023/2024

Abstract

This thesis presents the Vulnerable Identities Recognition Corpus (VIRC), a novel dataset for hate speech analysis consisting of 880 news headlines in Italian and Spanish, with a focus on racist content. The corpus employs a multi-layered annotation scheme that combines Named Entity Recognition with the identification of vulnerable identities, dangerous speech, and derogatory mentions. This approach enables a more nuanced understanding of how discriminatory discourse operates in news media. The research includes a comprehensive evaluation of inter-annotator agreement, revealing strong consistency in identifying vulnerable identities but highlighting challenges in annotating some forms of harmful content. Through detailed linguistic analysis, the study identifies four primary semantic frames through which discriminatory discourse is constructed. These frames demonstrate how news discourse systematically constructs and perpetuates social vulnerability. The study includes experimental evaluations using Large Language Models (T5 and BART) in zero-shot settings, assessing their capability to automatically detect different types of annotated content. Based on these initial results, the corpus was expanded with additional data and annotators, using revised guidelines to address identified challenges. The research concludes with critical reflections on the ethical implications of hate speech annotation, examining the impact on annotators and the broader responsibilities in developing computational linguistics resources for studying harmful content. This work contributes to hate speech detection by providing a valuable resource for Italian and Spanish and introducing a novel annotation scheme that captures the complex ways in which discriminatory discourse operates in news media by focusing on its target.
Recognizing Vulnerable Identities in Italian and Spanish News Headlines: a New Corpus for a Fine-Grained Analysis of Hate Speech
This thesis presents the Vulnerable Identities Recognition Corpus (VIRC), a novel dataset for hate speech analysis consisting of 880 news headlines in Italian and Spanish, with a focus on racist content. The corpus employs a multi-layered annotation scheme that combines Named Entity Recognition with the identification of vulnerable identities, dangerous speech, and derogatory mentions. This approach enables a more nuanced understanding of how discriminatory discourse operates in news media. The research includes a comprehensive evaluation of inter-annotator agreement, revealing strong consistency in identifying vulnerable identities but highlighting challenges in annotating some forms of harmful content. Through detailed linguistic analysis, the study identifies four primary semantic frames through which discriminatory discourse is constructed. These frames demonstrate how news discourse systematically constructs and perpetuates social vulnerability. The study includes experimental evaluations using Large Language Models (T5 and BART) in zero-shot settings, assessing their capability to automatically detect different types of annotated content. Based on these initial results, the corpus was expanded with additional data and annotators, using revised guidelines to address identified challenges. The research concludes with critical reflections on the ethical implications of hate speech annotation, examining the impact on annotators and the broader responsibilities in developing computational linguistics resources for studying harmful content. This work contributes to hate speech detection by providing a valuable resource for Italian and Spanish and introducing a novel annotation scheme that captures the complex ways in which discriminatory discourse operates in news media by focusing on its target.
Autorizzo consultazione esterna dell'elaborato
File in questo prodotto:
File Dimensione Formato  
Tesi_finale_Arianna_Longo.pdf

non disponibili

Dimensione 930.62 kB
Formato Adobe PDF
930.62 kB Adobe PDF

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14240/165840