One Instruction Is a Benchmark: End-to-End Instruction-Following Evaluation with FlexBench

One Instruction Is a Benchmark: End-to-End Instruction-Following Evaluation with FlexBench

ACL ARR 2026 January Submission5232 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Instruction Following, Automated Evaluation, Benchmark Construction

Abstract: Real-world deployments of of large language models(LLMs) increasingly depend on long, precise, and carefully constructed user instructions, such as standard operating procedures(SOPs), state machines, and multi-step workflows. In this setting, aggregate leaderboard performance is often a weak proxy for a model’s ability to faithfully execute any particular instruction that a practitioner cares about. To address this gap, we present FlexBench, an on-demand framework that automatically transforms a single seed instruction into an instruction-specialized benchmark. FlexBench (i) decomposes the instruction into a set of verifiable evaluation dimensions by treating it as a collection of checkable clauses, and (ii) generates a conversation corpus using a leakage-resistant user simulator. We further introduce FlexEval, which maps per-dimension, tri-valued judgments (yes/no/unknown) into instruction-level metrics that reflect workflow progress (Coverage) and conditional correctness (Achievement). Together, FlexBench and FlexEval provide a fully automated pipeline for instruction-specialized evaluation, enabling practitioners to build a benchmark tailored to their own instruction and to obtain reproducible, fine-grained diagnostics of instruction following. We validate the framework on 248 complex single-turn instructions and conduct extensive experiments across 10 leading LLMs in three multi-turn conversational scenarios with long, branching instructions. Our source code is publicly available at https://anonymous.4open.science/r/FlexBench-E0D7.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: Dialogue and Interactive Systems, Resources and Evaluation, Generation

Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models, Data resources

Languages Studied: English

Submission Number: 5232

Loading