Keywords: generalization; evaluation; contrast set; natural language inference
Abstract: With many benchmarks becoming saturated, it has been paramount to create new datasets that evaluate the generalization capacity of current state-of-the-art models in reasoning. However, designing new quality reasoning datasets is challenging, as their manual construction is costly, while their automatic generation is unreliable or often leads to synthetic data with limited scope. In this paper, we propose the Minimal Expression-Replacement GEneralization (MERGE) test that evaluates the robustness of reasoning models against non-adversarial variants of existing evaluation datasets. We automatically obtain high-quality variants from the original instances with Minimal Expression REplacement (MERE) generation that utilizes Masked Language Models (MLMs) and safeguarding filters. We apply the MERGE test to Natural Language Inference (NLI), a popular task of reasoning. We generate new NLI datasets from two existing common ones with the MERE generation and use them to evaluate multiple strong NLI models. The results indicate that both LLMs and fine-tuned NLI models generalize poorly: they struggle to consistently and correctly classify variants that differ only minimally from the original ones, both at the surface level and in terms of reasoning. Further, we also analyze how certain aspects in variant generation, such as the word class and the source MLMs, affect model performance.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: robustness; corpus creation; evaluation methodologies; textual entailment; natural language inference
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 2070
Loading