CodeEvo: Interaction-Driven Synthesis of Code-centric Data through Hybrid and Iterative Feedback

Qiushi Sun; Jingyang Gong; Qipeng Guo; Lei Li; Fei Yuan

CodeEvo: Interaction-Driven Synthesis of Code-centric Data through Hybrid and Iterative Feedback

Qiushi Sun, Jingyang Gong, Qipeng Guo, Lei Li, Fei Yuan

Published: 22 Sept 2025, Last Modified: 25 Nov 2025DL4C @ NeurIPS 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Data Synthesis, Code data, Agents

Abstract: Acquiring high-quality instruction-code pairs is essential for training Large Language Models for code generation. Manually curated data is expensive and limited in scale, motivating the development of code-centric synthesis methods. Yet, current approaches often rely on predefined heuristics, resulting in synthetic data that is ungrounded, repetitive, or simplistic. We propose CodeEvo, a framework inspired by collaborative programming that employs two interacting LLM agents. A Coder generates and refines solutions, while a Reviewer directs the synthesis process. To overcome the limitations of simple heuristics, the Reviewer first constructs a Schema, a structured blueprint that explicitly plans the logic, constraints, and complexity of a new instruction prior to its generation. This planning process is complemented by a hybrid feedback mechanism that combines compiler determinism with the agent's semantic evaluation, ensuring rigorous quality control. Extensive experiments demonstrate that models fine-tuned on CodeEvo data significantly outperform established baselines across code generation benchmarks. In-depth analyses further provide insights into effective code-centric data synthesis.

Submission Number: 46

Loading