Keywords: Data Synthesis, Code data, Agents
Abstract: Acquiring high-quality instruction-code pairs is essential for training Large Language Models for code generation. Manually curated data is expensive and limited in scale, motivating the development of code-centric synthesis methods. Yet, current approaches often rely on predefined heuristics, resulting in synthetic data that is ungrounded, repetitive, or simplistic. We propose CodeEvo, a framework inspired by collaborative programming that employs two interacting LLM agents. A Coder generates and refines solutions, while a Reviewer directs the synthesis process. To overcome the limitations of simple heuristics, the Reviewer first constructs a Schema, a structured blueprint that explicitly plans the logic, constraints, and complexity of a new instruction prior to its generation. This planning process is complemented by a hybrid feedback mechanism that combines compiler determinism with the agent's semantic evaluation, ensuring rigorous quality control. Extensive experiments demonstrate that models fine-tuned on CodeEvo data significantly outperform established baselines across code generation benchmarks. In-depth analyses further provide insights into effective code-centric data synthesis.
Submission Number: 46
Loading