One Instruction Is a Benchmark: End-to-End Instruction-Following Evaluation with FlexBench

Yuchen Ge; Yizhao Sun; Wei He; Jinpeng Ou; Wenzhe Niu; Chengshun Shi; Huichuan Fan; Biao Ma; Shuo Huang; Gao Jiuchong; Jinghua Hao; Renqing He

One Instruction Is a Benchmark: End-to-End Instruction-Following Evaluation with FlexBench

Yuchen Ge, Yizhao Sun, Wei He, Jinpeng Ou, Wenzhe Niu, Chengshun Shi, Huichuan Fan, Biao Ma, Shuo Huang, Gao Jiuchong, Jinghua Hao, Renqing He

16 Sept 2025 (modified: 12 Dec 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Instruction Following, Automated Evaluation, Benchmark Construction

Abstract: As Large Language Models (LLMs) grow increasingly capable and accept ever longer, more nuanced instructions, models exhibit wide variability in their instruction following across prompts. Fixed, general-purpose benchmarks therefore fail to reflect real deployment performance, where evaluation must be tailored to the instruction at hand. We present FlexBench, a self-evolving framework that automatically constructs a specialized instruction-following benchmark comprising a set of evaluation dimensions and a conversation corpus from only a single instruction. For evaluation, we also introduce FlexEval, which aggregates tri-valued(yes/no/unknown), per-dimension decisions into instruction-level metrics that jointly capture workflow progress(Coverage) and conditional correctness(Achievement). Together, our work establishes a fully automated paradigm for customized benchmark construction, which enables instruction-specialized evaluations that adapt to arbitrary task requirements and deliver fine-grained, reproducible judgments of instruction following, turning open-ended instructions into end-to-end evaluations. We validate our framework on 248 single-turn complex instructions, and further conduct extensive experiments on 10 leading LLMs across three multi-turn conversational scenarios with complex instructions. Our results show that FlexBench and FlexEval deliver instruction-specialized assessments and provide actionable insights for improving LLM instruction following.

Primary Area: datasets and benchmarks

Submission Number: 6960

Loading