Optimizing Multilingual MWE Identification: From Morphologically-Filtered LLMs to Pure Transformer Architectures

Irina Moise; Sergiu Nisioi

Optimizing Multilingual MWE Identification: From Morphologically-Filtered LLMs to Pure Transformer Architectures

Irina Moise, Sergiu Nisioi

Published: 27 May 2026, Last Modified: 27 May 2026UniDive 2026EveryoneRevisionsCC BY-SA 4.0

Keywords: Multiword Expressions (MWE), PARSEME Shared Task, Large Language Models (LLM), Morphological Filtering, Transformer-CRF, BIESO+ Tagging, Cross-lingual Stability

Working Group: WG1: Corpus annotation, WG3: Multilingual and cross-lingual language technology

WG1 Tasks: Task 1.6: Identification and Annotation of MWES in corpus languages

Abstract: This paper presents a hybrid approach to multilingual Multiword Expression (MWE) identification, transitioning from LLM-based prompting to structured Transformer architectures. Our current system utilizes Gemini 2.0 Flash-Lite combined with a Universal POS-based filter to achieve high precision and the highest Shannon evenness score across 17 languages in the PARSEME 2.0 Shared Task. To address existing limitations, we propose a future framework based on a Transformer-CRF model using BIESO+ tagging and POS injection. This evolution aims to combine the cross-lingual fairness of generative models with the structural rigor of syntax-aware sequence labeling.

WG3 Tasks: Task 3.4 Evaluation campaign: PARSEME 2.0: a multilingual shared task proposal on identification and paraphrasing of multiword expressions

Tracks For Type Of Contribution: Complete work (including previously published work)

Do You Need Visa To Attend The 4th UniDive General Meeting In Romania: No

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 66

Loading