One Instruction Is a Benchmark: End-to-End Instruction-Following Evaluation with FlexBench

16 Sept 2025 (modified: 12 Dec 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models, Instruction Following, Automated Evaluation, Benchmark Construction
Abstract: As Large Language Models (LLMs) grow increasingly capable and accept ever longer, more nuanced instructions, models exhibit wide variability in their instruction following across prompts. Fixed, general-purpose benchmarks therefore fail to reflect real deployment performance, where evaluation must be tailored to the instruction at hand. We present FlexBench, a self-evolving framework that automatically constructs a specialized instruction-following benchmark comprising a set of evaluation dimensions and a conversation corpus from only a single instruction. For evaluation, we also introduce FlexEval, which aggregates tri-valued(yes/no/unknown), per-dimension decisions into instruction-level metrics that jointly capture workflow progress(Coverage) and conditional correctness(Achievement). Together, our work establishes a fully automated paradigm for customized benchmark construction, which enables instruction-specialized evaluations that adapt to arbitrary task requirements and deliver fine-grained, reproducible judgments of instruction following, turning open-ended instructions into end-to-end evaluations. We validate our framework on 248 single-turn complex instructions, and further conduct extensive experiments on 10 leading LLMs across three multi-turn conversational scenarios with complex instructions. Our results show that FlexBench and FlexEval deliver instruction-specialized assessments and provide actionable insights for improving LLM instruction following.
Primary Area: datasets and benchmarks
Submission Number: 6960
Loading