Abstract: Multiword expressions (MWEs) refer to idiomatic sequences of multiple words.
MWE identification, i.e., detecting MWEs in text, can play a key role in downstream tasks such as machine translation.
Existing datasets for MWE identification are inconsistently annotated, limited to a single type of MWE, or limited in size.
To enable reliable and comprehensive evaluation, we created CoAM: Corpus of All-Type Multiword Expressions, a dataset of 1.3K sentences constructed through a multi-step process to enhance data quality consisting of human annotation, human review, and automated consistency checking.
Additionally, for the first time in a dataset of this kind, CoAM's MWEs are tagged with MWE types, such as Noun and Verb, enabling fine-grained error analysis.
Annotations for CoAM were collected using a new interface created with our interface generator, which allows easy and flexible annotation of MWEs in any form, including discontinuous ones.
Through experiments using CoAM, we find that a fine-tuned large language model outperforms the current state-of-the-art approach for MWE identification.
Furthermore, analysis using our MWE type tagged data reveals that Verb MWEs are easier than Noun MWEs to identify across approaches.
Paper Type: Long
Research Area: Semantics: Lexical and Sentence-Level
Research Area Keywords: multi-word expressions,lexical resources
Contribution Types: NLP engineering experiment, Data resources
Languages Studied: English
Submission Number: 5858
Loading