Efficient Code LLM Training via Distribution-Consistent and Diversity-Aware Data Selection

Efficient Code LLM Training via Distribution-Consistent and Diversity-Aware Data Selection

ACL ARR 2026 January Submission4442 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Code LLMs, Data-effivient Fine-tuning, Data Selection, Code Generation

Abstract: Recent advancements in large language models (LLMs) have significantly improved code generation and program comprehension, accelerating the evolution of software engineering. Current methods primarily enhance model performance by leveraging vast amounts of data, focusing on data quantity while often overlooking data quality, thereby reducing training efficiency. To address this, we introduce an approach that utilizes a parametric model for code data selection, aimed at improving both training efficiency and model performance. Our method optimizes the parametric model to ensure distribution consistency and diversity within the selected subset, guaranteeing high-quality data. Experimental results demonstrate that using only 10K samples, our method achieves gains of 2.4% (HumanEval) and 2.3% (MBPP) over 92K full-sampled baseline, outperforming other sampling approaches in both performance and efficiency. This underscores that our method effectively boosts model performance while significantly reducing computational costs. Code is available at [here](https://anonymous.4open.science/r/efficode-finetune-6C64).

Paper Type: Long

Research Area: Low-resource Methods for NLP

Research Area Keywords: Data-effivient Fine-tuning, Data Selection

Contribution Types: Approaches to low-resource settings

Languages Studied: English

Submission Number: 4442

Loading