Keywords: Instruction Following, Large Language Model
TL;DR: Inverse IFEval tests LLMs’ ability to override training bias and follow adversarial instructions, using 1,012 bilingual questions across 23 domains. It stresses adaptability beyond fluency and accuracy.
Abstract: Large Language Models (LLMs) achieve strong performance on diverse tasks but
often exhibit cognitive inertia, struggling to follow instructions that conflict with
the standardized patterns learned during supervised fine-tuning (SFT). To evaluate
this limitation, we propose Inverse IFEval, a benchmark that measures models’
Counter-intuitive Ability—their capacity to override training-induced biases and
comply with adversarial instructions. Inverse IFEval introduces eight types of
such challenges, including Question Correction, Intentional Textual Flaws, Code
without Comments, and Counterfactual Answering. Using a human-in-the-loop
pipeline, we construct a dataset of 1012 high-quality Chinese and English questions across 23 domains, evaluated under an optimized LLM-as-a-Judge framework. Experiments on existing
leading LLMs demonstrate the necessity of our proposed Inverse IFEval benchmark.
Our findings emphasize that future alignment efforts should not only pursue fluency and factual correctness but also account for adaptability under unconventional contexts. We hope that Inverse IFEval serves as both a diagnostic tool and a foundation for developing methods that mitigate cognitive inertia, reduce overfitting to narrow patterns, and ultimately enhance the instruction-following reliability of LLMs in diverse and unpredictable real-world scenarios.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 19376
Loading