OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models

OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models

ACL ARR 2025 February Submission5774 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Code LLMs have been widely used in various domains, including code generation, logical reasoning, and agent systems. However, open-access code LLMs mostly only release weights, lacking key features such as reproducible data pipelines and transparent training protocols, which are crucial for advancing deeper, more reliable investigations. To address the gap, we introduce \textbf{OpenCoder}, a top-tier code LLM that not only achieves performance comparable to leading models but also serves as an ``\textbf{open cookbook}'' for the research community. Unlike most prior efforts, we release not only model weights and inference code, but also the reproducible training data, complete data processing pipeline, rigorous experimental ablation results, and detailed training protocols for open scientific research. Our work identifies the key ingredients for building a top-tier code LLM: optimized heuristic rules for data cleaning and deduplication, effective recall of code-related text corpus, and high-quality synthetic data for both annealing and supervised fine-tuning stages. By offering this level of openness, we aim to broaden access to all aspects of a top-tier code LLM, with OpenCoder serving as both a powerful model and an open foundation to accelerate research and enable reproducible advancements in code intelligence.

Paper Type: Long

Research Area: Language Modeling

Research Area Keywords: pre-training,fine-tuning

Contribution Types: NLP engineering experiment, Reproduction study, Publicly available software and/or pre-trained models, Data resources

Languages Studied: English,Chinese

Submission Number: 5774

Loading