Abstract: Large Language Models (LLMs) have recently demonstrated remarkable coding capabilities. However, assessing code generation from verifiable properties and aligning it with developer preferences remains a challenge. In this paper, we explore two key questions under the new challenge of code preference learning: (i) How to train models to predict meaningful preferences for code; and (ii) how do code preferences based on verifiers, human, and neural models align with each other? To this end, we introduce **CodeFavor**, an open recipe to train pairwise code preference models using synthetic code evolution, including code commits and code critiques. We evaluate code preferences via **CodePrefBench**, a new benchmark with 1364 rigorously curated code preference tasks to cover three verifiable properties: correctness, efficiency, and security, along with human preference. Our evaluation shows that CodeFavor holistically improves model-based code preferences by up to $28.8%$. Our comprehensive controlled experiments also validate the design choices in CodeFavor. Furthermore, we quantified the cost and limitations of human-based code preference: (i) Despite spending 23 person-minutes per task, $15\sim 40%$ of tasks remain unsolved; and (ii) human preference is the most accurate on code correctness while underperforming model-based preferences on non-functional objectives.
Paper Type: Long
Research Area: Language Modeling
Research Area Keywords: Code Generation, Code Language Models, Code Preference, Preference Learning, Machine Learning for Code
Contribution Types: NLP engineering experiment, Data resources
Languages Studied: English, Python
Submission Number: 7327
Loading