C$^3$-Bench: Evaluating and Achieving Controllable Code Completion in Code LLM

Jiajun Zhang; Zeyu Cui; Lei Zhang; Jiaxi Yang; Qiang Liu; Zilei Wang; Binyuan Hui; Liang Wang; Junyang Lin

C$^3$-Bench: Evaluating and Achieving Controllable Code Completion in Code LLM

Jiajun Zhang, Zeyu Cui, Lei Zhang, Jiaxi Yang, Qiang Liu, Zilei Wang, Binyuan Hui, Liang Wang, Junyang Lin

01 Sept 2025 (modified: 30 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Code Language Models, Code Completion, Instruction Following

TL;DR: We created C³-Bench, a new benchmark for code LLMs that tests both code correctness and instruction following, revealing gaps in current models and developed a better-performing solution through automated training data generation.

Abstract: Code completion has become a central task, gaining significant attention with the rise of large language model (LLM)-based tools in software engineering. Although recent advances have greatly improved LLMs' code completion abilities, evaluation methods have not advanced equally. Most current benchmarks focus solely on functional correctness of code completions based on given context, overlooking models' ability to follow user instructions during completion\textemdash a common scenario in LLM-assisted programming. To address this limitation, we present the first instruction-guided code completion benchmark, \textbf{\underline{C}}ontrollable \textbf{\underline{C}}ode \textbf{\underline{C}}ompletion Benchmark (C$^3$-Bench), comprising 2,195 carefully designed completion tasks. Through comprehensive evaluation of over 40 mainstream LLMs across C$^3$-Bench and conventional benchmarks, we reveal substantial gaps in instruction-following capabilities between open-source and advanced proprietary models during code completion tasks. Moreover, we develop a straightforward data synthesis pipeline that leverages Qwen2.5-Coder to generate high-quality instruction-completion pairs for supervised fine-tuning (SFT). The resulting model, Qwen2.5-Coder-C$^3$, achieves state-of-the-art performance on C$^3$-Bench. Our findings provide valuable insights for enhancing LLMs' code completion and instruction-following capabilities, establishing new directions for future research in code LLMs. To facilitate reproducibility and foster further research in code LLMs, we open-source all code, datasets, and models.

Primary Area: other topics in machine learning (i.e., none of the above)

Submission Number: 103

Loading