Keywords: Large Language Models, Instruction Following, Automated Evaluation, Benchmark Construction
Abstract: Real-world deployments of of large language models(LLMs) increasingly depend on long, precise, and carefully constructed user instructions, such as standard operating procedures(SOPs), state machines, and multi-step workflows. In this setting, aggregate leaderboard performance is often a weak proxy for a model’s ability to faithfully execute any particular instruction that a practitioner cares about. To address this gap, we present FlexBench, an on-demand framework that automatically transforms a single seed instruction into an instruction-specialized benchmark. FlexBench (i) decomposes the instruction into a set of verifiable evaluation dimensions by treating it as a collection of checkable clauses, and (ii) generates a conversation corpus using a leakage-resistant user simulator. We further introduce FlexEval, which maps per-dimension, tri-valued judgments (yes/no/unknown) into instruction-level metrics that reflect workflow progress (Coverage) and conditional correctness (Achievement). Together, FlexBench and FlexEval provide a fully automated pipeline for instruction-specialized evaluation, enabling practitioners to build a benchmark tailored to their own instruction and to obtain reproducible, fine-grained diagnostics of instruction following. We validate the framework on 248 complex single-turn instructions and conduct extensive experiments across 10 leading LLMs in three multi-turn conversational scenarios with long, branching instructions. Our source code is publicly available at https://anonymous.4open.science/r/FlexBench-E0D7.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: Dialogue and Interactive Systems, Resources and Evaluation, Generation
Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models, Data resources
Languages Studied: English
Submission Number: 5232
Loading