CodeFlowBench: A Multi-turn, Iterative Benchmark for Complex Code Generation

CodeFlowBench: A Multi-turn, Iterative Benchmark for Complex Code Generation

ACL ARR 2026 January Submission1734 Authors

31 Dec 2025 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Code Generation, Large Language Models, Benchmark, Multi-turn Interaction, CodeFlow, Dependency-aware Evaluation

Abstract: Modern software development demands code that is maintainable, testable, and scalable by organizing the implementation into modular components with iterative reuse of existing codes. We formalize this iterative, multi-turn paradigm as \textit{codeflow} and introduce \textbf{CodeFlowBench}, the first benchmark designed to comprehensively evaluate LLMs’ ability to perform codeflow - implementing new functionality by reusing existing functions over multiple turns. CodeFlowBench comprises two complementary components: CodeFlowBench-Comp, a core collection of 5,000+ competitive programming problems from Codeforces updated via an automated pipeline and CodeFlowBench-Repo, which is sourced from GitHub repositories to better reflect real-world scenarios. Furthermore, a novel evaluation framework featured dual assessment protocol and structural metrics derived from dependency trees is introduced. Extensive experiments reveal significant performance degradation in multi-turn codeflow scenarios. Furthermore, our in-depth analysis illustrates that model performance inversely correlates with dependency complexity. These findings not only highlight the critical challenges for supporting real-world workflows, but also establish CodeFlowBench as an essential tool for advancing code generation research. Data and code is hosted in \url{https://anonymous.4open.science/r/CodeFlowBench-5E2983}.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: benchmarking, code generation and understanding, corpus creation, evaluation methodologies, metrics

Contribution Types: Data resources, Data analysis

Languages Studied: English

Submission Number: 1734

Loading