SCoder: Empowering Code LLMs through Bootstrapping Small-Scale Data Synthesizers

SCoder: Empowering Code LLMs through Bootstrapping Small-Scale Data Synthesizers

ACL ARR 2025 February Submission6867 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Existing code large language models (LLMs) often rely on large-scale instruction data distilled from proprietary LLMs for fine-tuning, which typically incurs high costs. In this paper, we explore the potential of small-scale open-source LLMs (e.g., 7B) as synthesizers for high-quality code instruction data construction. We first observe that the data synthesis capability of small-scale LLMs can be enhanced by training on a few superior data synthesis samples from proprietary LLMs. Building on this, we propose a novel multi-round self-distillation approach to bootstrap small-scale LLMs, transforming them into powerful synthesizers that reduce reliance on proprietary LLMs and minimize costs. Concretely, we design multi-checkpoint sampling and multi-aspect scoring strategies to self-distill data synthesis samples and filter them in each round. Based on these filtered samples, we further select high-value ones by introducing an optimizer-aware influence estimation method, which estimates the influence of each self-distilled sample by calculating its gradient similarity to the superior samples from proprietary LLMs. Based on the code instruction data from our small-scale synthesizers, we introduce SCoder, a family of code generation models fine-tuned from DeepSeek-Coder. SCoder achieves state-of-the-art code generation capabilities, demonstrating the effectiveness of our method.

Paper Type: Long

Research Area: Generation

Research Area Keywords: code generation, large language model, bootstrap

Contribution Types: Model analysis & interpretability

Languages Studied: english

Submission Number: 6867

Loading