The present study describes the linguistic annotation (e.g. the process of adding descriptive information to raw language data) of a learner corpus comprising Italian texts written by Russian learners. The corpus, annotated in Universal Dependencies format, aims to enrich the VALICO-UD treebank, a valuable learner Italian resource developed at the University of Turin. All the texts used in this study, and those included in VALICO-UD, are in turn drawn from VALICO, a large collection of non-native Italian texts elicited by comic strips. VALICO-UD currently includes part of the texts collected in VALICO, but not texts authored by Russian native speakers. VALICO-UD indeed consists of 237 texts authored by English, French, Spanish, and German native speakers and 237 corresponding corrected texts. Among these texts, learner and corrected versions, which are all automatically annotated, a subcorpus of 36 learner texts and corresponding corrected versions feature error annotation and linguistic annotation manually revised. To fill the lack of texts authored by Russian native speakers, we have built the novel corpus described in this dissertation, which includes 6 productions (66 learner sentences) written by Russian learners at varying proficiency levels. In line with VALICO-UD, each learner sentence in this novel corpus is paired with a target hypothesis (a corrected equivalent produced by native Italian speakers), contained within a separate corpus, which is sentence-aligned to the original learner corpus. The linguistic annotation is applied to both corpora (i.e. one including learner sentences and one including the corresponding target hypotheses) and involves a combined approach. First, the automatic annotation is performed by the UDPipe parser, a software tool offering automatic analysis of texts. Second, the manual correction of the parser's output is carried out using the INCEpTION platform, a textual annotation tool. The analysis of the annotated corpus reveals specific error patterns characteristic of Russian learners of written Italian, which are documented through real examples of their interlanguage (a linguistic system created by language learners during their process of second/foreign language acquisition), taken from the same corpus.
Extending the Italian learner treebank Valico-UD: annotation and analysis of texts written by Russian native speakers
VITULLO, RAISSA PIA
2022/2023
Abstract
The present study describes the linguistic annotation (e.g. the process of adding descriptive information to raw language data) of a learner corpus comprising Italian texts written by Russian learners. The corpus, annotated in Universal Dependencies format, aims to enrich the VALICO-UD treebank, a valuable learner Italian resource developed at the University of Turin. All the texts used in this study, and those included in VALICO-UD, are in turn drawn from VALICO, a large collection of non-native Italian texts elicited by comic strips. VALICO-UD currently includes part of the texts collected in VALICO, but not texts authored by Russian native speakers. VALICO-UD indeed consists of 237 texts authored by English, French, Spanish, and German native speakers and 237 corresponding corrected texts. Among these texts, learner and corrected versions, which are all automatically annotated, a subcorpus of 36 learner texts and corresponding corrected versions feature error annotation and linguistic annotation manually revised. To fill the lack of texts authored by Russian native speakers, we have built the novel corpus described in this dissertation, which includes 6 productions (66 learner sentences) written by Russian learners at varying proficiency levels. In line with VALICO-UD, each learner sentence in this novel corpus is paired with a target hypothesis (a corrected equivalent produced by native Italian speakers), contained within a separate corpus, which is sentence-aligned to the original learner corpus. The linguistic annotation is applied to both corpora (i.e. one including learner sentences and one including the corresponding target hypotheses) and involves a combined approach. First, the automatic annotation is performed by the UDPipe parser, a software tool offering automatic analysis of texts. Second, the manual correction of the parser's output is carried out using the INCEpTION platform, a textual annotation tool. The analysis of the annotated corpus reveals specific error patterns characteristic of Russian learners of written Italian, which are documented through real examples of their interlanguage (a linguistic system created by language learners during their process of second/foreign language acquisition), taken from the same corpus.File | Dimensione | Formato | |
---|---|---|---|
1035770_tesi_vitullo-defi.pdf
non disponibili
Tipologia:
Altro materiale allegato
Dimensione
3.49 MB
Formato
Adobe PDF
|
3.49 MB | Adobe PDF |
I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/20.500.14240/107395