Keywords: Code Generation, Large Language Models, Benchmark, Multi-turn Interaction, CodeFlow, Dependency-aware Evaluation
Abstract: Modern software development demands code that is maintainable, testable, and scalable by organizing the implementation into modular components with iterative reuse of existing codes. We formalize this iterative, multi-turn paradigm as \textit{codeflow} and introduce \textbf{CodeFlowBench}, the first benchmark designed to comprehensively evaluate LLMs’ ability to perform codeflow - implementing new functionality by reusing existing functions over multiple turns. CodeFlowBench comprises two complementary components: CodeFlowBench-Comp, a core collection of 5,000+ competitive programming problems from Codeforces updated via an automated pipeline and CodeFlowBench-Repo, which is sourced from GitHub repositories to better reflect real-world scenarios. Furthermore, a novel evaluation framework featured dual assessment protocol and structural metrics derived from dependency trees is introduced. Extensive experiments reveal significant performance degradation in multi-turn codeflow scenarios. Furthermore, our in-depth analysis illustrates that model performance inversely correlates with dependency complexity. These findings not only highlight the critical challenges for supporting real-world workflows, but also establish CodeFlowBench as an essential tool for advancing code generation research. Data and code is hosted in \url{https://anonymous.4open.science/r/CodeFlowBench-5E2983}.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: benchmarking, code generation and understanding, corpus creation, evaluation methodologies, metrics
Contribution Types: Data resources, Data analysis
Languages Studied: English
Submission Number: 1734
Loading