Keywords: benchmark, reasoning, semantics, multiword expression
Abstract: We present SemanticQA, a comprehensive benchmark suite designed to assess language models (LMs) across ten semantic phrase (SP) processing tasks. Unlike prior benchmarks, it provides a unified evaluation setting encompassing both general SPs, such as lexical collocations (LC), and three fine-grained categories: idiomatic expressions (IE), noun compounds (NC), and verbal constructions (VC). We systematically evaluate LMs of diverse architectures and scales on classification, extraction, and interpretation tasks. Our results reveal substantial performance variation, particularly on tasks requiring compositional semantic reasoning, highlighting differences in LMs’ reasoning capabilities and semantic understanding. These findings provide actionable insights for advancing the development of LMs with stronger SP comprehension. The code, data, and models will be made publicly available upon completion of the review process.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: benchmarking,reasoning,semantics,few-shot QA
Contribution Types: Model analysis & interpretability, Data resources
Languages Studied: English
Submission Number: 523
Loading