Kaiyuan-2B: Pushing the Limits of Fully-Open Language Model through Data Benchmarking and Curriculum

ACL ARR 2026 January Submission4608 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Open source, Language Model pretraining, Data benchmarking, Data curriculum
Abstract: The rapid advancement of Large Language Models (LLMs) has resulted in a significant knowledge gap between the open-source community and industry, primarily because the latter relies on closed-source, high-quality data and training recipes. To address this, we introduce Kaiyuan-2B, a fully open-source 2-billion-parameter model focused on improving training efficiency and effectiveness under resource constraints. Our methodology includes three key innovations in a data-centric way: a Quantile Data Benchmarking method for systematically comparing heterogeneous open-source datasets and providing insights on data mixing strategies; a Bi-Level Curriculum Training policy that progressively introduces domain-specialized and refined samples at both phase and instance levels; and a Strategic Selective Repetition scheme within the multi-phase paradigm to effectively leverage sparse, high-quality data. Kaiyuan-2B achieves performance competitive with state-of-the-art fully open-source models, demonstrating practical and scalable solutions for resource-limited pretraining. We release all assets (including model weights, data, and code) under the Apache 2.0 license.
Paper Type: Long
Research Area: Language Models
Research Area Keywords: Language Modeling
Contribution Types: Publicly available software and/or pre-trained models, Data analysis
Languages Studied: English, Chinese
Submission Number: 4608
Loading