MERGE: Minimal Expression-Replacement GEneralization Test for Natural Language Inference

MERGE: Minimal Expression-Replacement GEneralization Test for Natural Language Inference

ACL ARR 2026 March Submission2070 Authors

17 Mar 2026 (modified: 07 Jun 2026)ACL ARR 2026 March SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: generalization; evaluation; contrast set; natural language inference

Abstract: With many benchmarks becoming saturated, it has been paramount to create new datasets that evaluate the generalization capacity of current state-of-the-art models in reasoning. However, designing new quality reasoning datasets is challenging, as their manual construction is costly, while their automatic generation is unreliable or often leads to synthetic data with limited scope. In this paper, we propose the Minimal Expression-Replacement GEneralization (MERGE) test that evaluates the robustness of reasoning models against non-adversarial variants of existing evaluation datasets. We automatically obtain high-quality variants from the original instances with Minimal Expression REplacement (MERE) generation that utilizes Masked Language Models (MLMs) and safeguarding filters. We apply the MERGE test to Natural Language Inference (NLI), a popular task of reasoning. We generate new NLI datasets from two existing common ones with the MERE generation and use them to evaluate multiple strong NLI models. The results indicate that both LLMs and fine-tuned NLI models generalize poorly: they struggle to consistently and correctly classify variants that differ only minimally from the original ones, both at the surface level and in terms of reasoning. Further, we also analyze how certain aspects in variant generation, such as the word class and the source MLMs, affect model performance.

Paper Type: Long

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: robustness; corpus creation; evaluation methodologies; textual entailment; natural language inference

Contribution Types: Model analysis & interpretability

Languages Studied: English

Submission Number: 2070

Loading