Keywords: Large language models (LLMs), Multi-turn Benchmark, Instruction-following, Code generation
Abstract: Code generation with large language models (LLMs) has become popular to software development, yet existing benchmarks like HumanEval and LBPP focus on single-turn task completion. In real-world scenarios, users often engage in multi-turn interactions, iteratively refining code through instruction-following feedback to meet complex requirements or constraints. Current benchmarks fail to capture the dynamic, instruction-driven nature of such workflows. To address this, we introduce CoCoPIF, a new evaluation pipeline for evaluating LLMs in multi-turn, instruction-following code generation, by emulating real-world interaction data from ShareGPT and problems from LiveCodeBench. Our framework dynamically transforms code problems into multi-turn tasks with verifiable instructions. It features an evaluation protocol that mirrors user-LLM interaction by iteratively refining model outputs through targeted feedback. Furthermore, our assessment approach evaluates both instruction adherence and functional correctness, delivering a reliable measurement of model performance. CoCoPIF reflects practical coding scenarios, providing a tool to assess LLMs in realistic, interactive programming contexts.
Primary Area: datasets and benchmarks
Submission Number: 17986
Loading