Extension of the ELEXIS-WSD Parallel Sense-Annotated Corpus Within UniDive: New Languages and Layers
Keywords: multilingual parallel corpus, sense annotations, corpus extension
Working Group: WG1: Corpus annotation, WG2: Lexicon-corpus interface
WG1 Tasks: Task 1.2 on MWE annotation guidelines and UD-PARSEME unification
Abstract: Within UniDive, Task 2.2 (Design of a lexicon-corpus interface) of WG2 involves the development and upgrade of the ELEXIS-WSD Parallel Sense-Annotated Corpus (ELEXIS-WSD for short), a small-scale parallel corpus consisting of language subcorpora containing translations of the same sentences in multiple languages; with manual tokenization, lemmatization, morphosyntactic tagging, and, most importantly, manually assigned sense annotations from a lexicographic or lexical resource (such as monolingual dictionaries or wordnets), directly linking the corpus to the lexicon. ELEXIS-WSD is a useful resource for word-sense disambiguation tasks and cross-lingual comparisons. The latest version before the introduction of UniDive was 1.1, with subcorpora for 10 languages: Bulgarian, Danish, Dutch, English, Estonian, Hungarian, Italian, Portuguese, Slovene and Spanish.
In this paper, we present the results of the activities of Task 2.2 of UniDive, which culminated in the publication of version 2.0 of ELEXIS-WSD in April 2026. The main goals included (1) the introduction of new languages to the corpus; (2) the addition of new annotation layers.
WG2 Tasks: Task 2.2: Design of a lexicon-corpus interface
Tracks For Type Of Contribution: Complete work (including previously published work)
Do You Need Visa To Attend The 4th UniDive General Meeting In Romania: No
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 37
Loading