Abstract: Multi-turn instruction following capability constitutes a core competency of large language models (LLMs) in real-world applications.
Existing evaluation benchmarks predominantly focus on fine-grained constraint satisfaction and domain-specific capability assessment, yet overlook the crucial structural dependency between dialogue turns that distinguishes multi-turn from single-turn interactions.
This structural dependency not only reflects user intent but also establishes a second dimension for instruction following evaluation beyond constraint satisfaction.
To address this gap, we propose StructFlowBench, a multi-turn instruction following benchmark with structural flow modeling.
The benchmark innovatively defines a structural flow framework comprising six fundamental inter-turn relationships, which not only introduces novel structural constraints for model evaluation but also serves as generation parameters for creating customized dialogue flows tailored to specific scenarios.
Adopting established LLM-based automatic evaluation methodologies, we conduct systematic evaluations of 13 leading open-source and closed-source LLMs.
Experimental results reveal significant deficiencies in current models' comprehension of multi-turn dialogue structures.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: benchmarking, evaluation methodologies, evaluation
Contribution Types: Data resources
Languages Studied: English
Submission Number: 6379
Loading