Optimizing Multilingual MWE Identification: From Morphologically-Filtered LLMs to Pure Transformer Architectures
Keywords: Multiword Expressions (MWE), PARSEME Shared Task, Large Language Models (LLM), Morphological Filtering, Transformer-CRF, BIESO+ Tagging, Cross-lingual Stability
Working Group: WG1: Corpus annotation, WG3: Multilingual and cross-lingual language technology
WG1 Tasks: Task 1.6: Identification and Annotation of MWES in corpus languages
Abstract: This paper presents a hybrid approach to multilingual Multiword Expression (MWE) identification, transitioning from LLM-based prompting to structured Transformer architectures. Our current system utilizes Gemini 2.0 Flash-Lite combined with a Universal POS-based filter to achieve high precision and the highest Shannon evenness score across 17 languages in the PARSEME 2.0 Shared Task. To address existing limitations, we propose a future framework based on a Transformer-CRF model using BIESO+ tagging and POS injection. This evolution aims to combine the cross-lingual fairness of generative models with the structural rigor of syntax-aware sequence labeling.
WG3 Tasks: Task 3.4 Evaluation campaign: PARSEME 2.0: a multilingual shared task proposal on identification and paraphrasing of multiword expressions
Tracks For Type Of Contribution: Complete work (including previously published work)
Do You Need Visa To Attend The 4th UniDive General Meeting In Romania: No
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 66
Loading