Abstract: Existing code large language models (LLMs) often rely on large-scale instruction data distilled from proprietary LLMs for fine-tuning, which typically incurs high costs. In this paper, we explore the potential of small-scale open-source LLMs (e.g., 7B) as synthesizers for high-quality code instruction data construction. We first observe that the data synthesis capability of small-scale LLMs can be enhanced by training on a few superior data synthesis samples from proprietary LLMs. Building on this, we propose a novel multi-round self-distillation approach to bootstrap small-scale LLMs, transforming them into powerful synthesizers that reduce reliance on proprietary LLMs and minimize costs. Concretely, we design multi-checkpoint sampling and multi-aspect scoring strategies to self-distill data synthesis samples and filter them in each round. Based on these filtered samples, we further select high-value ones by introducing an optimizer-aware influence estimation method, which estimates the influence of each self-distilled sample by calculating its gradient similarity to the superior samples from proprietary LLMs. Based on the code instruction data from our small-scale synthesizers, we introduce SCoder, a family of code generation models fine-tuned from DeepSeek-Coder. SCoder achieves state-of-the-art code generation capabilities, demonstrating the effectiveness of our method.
Paper Type: Long
Research Area: Generation
Research Area Keywords: code generation, large language model, bootstrap
Contribution Types: Model analysis & interpretability
Languages Studied: english
Submission Number: 6867
Loading