CodeLutra: Boosting LLM Code Generation via Preference-Guided Refinement

Leitian Tao; Xiang Chen; Tong Yu; Tung Mai; Ryan A. Rossi; Yixuan Li; Saayan Mitra

CodeLutra: Boosting LLM Code Generation via Preference-Guided Refinement

Leitian Tao, Xiang Chen, Tong Yu, Tung Mai, Ryan A. Rossi, Yixuan Li, Saayan Mitra

Published: 05 Jun 2025, Last Modified: 05 Jun 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large Language Models (LLMs) have revolutionized code generation but are require significant resources and tend to over-generalize, limiting their task-specific efficiency. Fine-tuning smaller, open-source LLMs is a cost-effective alternative, yet standard supervised approaches rely solely on correct examples, overlooking valuable insights from failures. We introduce CodeLutra, a new framework that leverages both correct and incorrect code attempts. Instead of purely instructing with correct solutions, CodeLutra uses iterative preference-based refinement, comparing successful and failed outputs to better approximate desired results. This process narrows the performance gap with state-of-the-art, larger models, without requiring massive datasets or auxiliary models. For example, on a challenging data science coding task, using only 500 samples improved Llama-3-8B’s accuracy from 28.2% to 48.6%, approaching GPT-4’s level. By capitalizing on both successes and mistakes, \textsc{CodeLutra} offers a scalable, efficient path to high-quality code generation, making smaller open-source models more competitive with leading closed-source alternatives.

Submission Length: Regular submission (no more than 12 pages of main content)

Changes Since Last Submission: We thank all reviewers for their insightful feedback. We have revised the paper accordingly, and the major changes are summarized below: - **[R1]** Added MBPP evaluation results in Appendix B.4 to demonstrate generalizability. - **[R1]** Clarified model access and reproducibility: included full checkpoint table (Appendix A.5) and clarified the use of API. - **[R1]** Corrected citation formatting throughout the paper. - **[R1]** Fixed incorrect figure reference from Figure 3(b) to Figure 4(b). - **[R2]** Expanded novelty discussion in Section 2 and Section 3, emphasizing differences between code generation and other tasks. - **[R2]** Explained the phenomenon of likelihood drop for both correct/rejected codes in DPO using insights from Liu et al. (2024). - **[R2]** Added details about test input generation and reference code access in Section 3.3. - **[R2]** Corrected formatting error in Table 1. - **[R3]** Added explanation for iteration 4 performance dip in Section 5.2. - **[R3]** Added Y-axis label to Figure 3. - **[R3]** Responded to generalization concern: clarified that while the method is general, the paper focuses on code due to its unique evaluation and constraints. All the added content has been marked as red. (* We refer to Reviewer Jkh2 as R1, Reviewer EmsM as R2, and Reviewer LfPu as R3.)

Assigned Action Editor: ~Rui_Zhang7

Submission Number: 4343

Loading