Keywords: Multiword Expressions, MWE annotation, Dependency-based projection, Universal Dependencies, Lexicon–corpus interface, French as a Foreign Language
Working Group: WG1: Corpus annotation, WG2: Lexicon-corpus interface
WG1 Tasks: Task 1.2 on MWE annotation guidelines and UD-PARSEME unification, Task 1.6: Identification and Annotation of MWES in corpus languages
Abstract: This paper presents a multi-step workflow for the annotation and projection of multiword expressions (MWEs) in learner-oriented corpora, within the framework of the UniDive project. The automatic identification of MWEs remains a major challenge due to their syntactic variability, morphosyntactic flexibility, and semantic opacity. While large annotated resources such as PARSEME and UniDive have advanced the field, manual annotation remains costly and difficult to scale. In response, this work proposes a syntax-based projection method that bridges lexicon- and corpus-based approaches, enabling the transfer of controlled annotations into contextualized data.
The study is based on a corpus of approximately 584,000 words, compiled from 40 pedagogical resources for French as a Foreign Language (FFL), including textbooks and assessment materials. This corpus provides a structured representation of learner-oriented input. The approach relies on a CEFR-graded lexicon of MWEs, which was manually annotated according to a typology designed from a learner-centered perspective. Unlike existing frameworks that prioritize morphosyntactic criteria, this typology distinguishes idiomatic expressions, opaque collocations, and transparent collocations based on semantic compositionality and learnability. To ensure annotation reliability, detailed guidelines were developed in the form of decision trees, inspired by the PARSEME framework but adapted to the specific objectives of language learning. Annotators apply a sequence of explicit linguistic tests addressing semantic compositionality, lexical substitutability, morphosyntactic flexibility, internal modification, and mechanisms of semantic opacity such as metaphor and metonymy. This structured process ensures that annotation decisions are explicit, reproducible, and comparable across annotators. Annotations were then consolidated through a majority voting procedure, involving three to ten annotators per expression, with expert adjudication in case of disagreement. This results in a stable and validated annotation layer used as a basis for projection.
The core contribution of the paper lies in a dependency-based projection method. Each lexicon entry is analyzed using the Stanza parser to obtain a Universal Dependencies (UD) representation, including lemmas, part-of-speech tags, morphological features, and dependency relations. These representations are abstracted into morphosyntactic patterns and dependency configurations, which are then matched against the parsed corpus. MWEs are thus identified not as fixed strings but as syntactic structures, allowing robustness to inflectional variation, word order variation, and modifier insertion.
The resulting pre-annotated corpus is subsequently validated through a second annotation phase using the INCEpTION platform. Annotators verify projected instances, correct errors, and identify missing expressions. Preliminary evaluation shows that while recall remains limited (below 50%), precision is high (over 90%), indicating that the method provides reliable candidate annotations that significantly reduce manual effort. Newly identified MWEs are progressively integrated into the lexicon, enabling iterative resource enrichment. In addition, the annotated corpus supports the automatic attribution of CEFR levels to MWEs based on their occurrence in pedagogical materials. The assigned level corresponds to the earliest level at which the expression appears, providing insights into the progression of phraseological competence in language learning.
Overall, this work proposes a scalable and linguistically grounded pipeline that combines controlled annotation, collective validation, and syntax-based projection. By structuring annotation through explicit decision procedures and representing MWEs as dependency patterns, the approach ensures both reliability and extensibility. It contributes to ongoing efforts within UniDive to integrate lexicon and corpus resources and opens perspectives for improving MWE detection, extending the methodology to other languages, and developing learner-oriented NLP applications.
WG2 Tasks: Task 2.2: Design of a lexicon-corpus interface
Tracks For Type Of Contribution: Work in progress
Do You Need Visa To Attend The 4th UniDive General Meeting In Romania: No
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 42
Loading