Revisiting a Pain in the Neck: A Semantic Reasoning Benchmark for Language Models

Revisiting a Pain in the Neck: A Semantic Reasoning Benchmark for Language Models

ACL ARR 2026 January Submission523 Authors

23 Dec 2025 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: benchmark, reasoning, semantics, multiword expression

Abstract: We present SemanticQA, a comprehensive benchmark suite designed to assess language models (LMs) across ten semantic phrase (SP) processing tasks. Unlike prior benchmarks, it provides a unified evaluation setting encompassing both general SPs, such as lexical collocations (LC), and three fine-grained categories: idiomatic expressions (IE), noun compounds (NC), and verbal constructions (VC). We systematically evaluate LMs of diverse architectures and scales on classification, extraction, and interpretation tasks. Our results reveal substantial performance variation, particularly on tasks requiring compositional semantic reasoning, highlighting differences in LMs’ reasoning capabilities and semantic understanding. These findings provide actionable insights for advancing the development of LMs with stronger SP comprehension. The code, data, and models will be made publicly available upon completion of the review process.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: benchmarking,reasoning,semantics,few-shot QA

Contribution Types: Model analysis & interpretability, Data resources

Languages Studied: English

Submission Number: 523

Loading