Keywords: Branch-Merge Distillation, Domain-specific Supervised Fine-Tuning, Model Merging
Abstract: It is beneficial but challenging to reduce the size of Large Language Models (LLMs) while maintaining their performance. Existing methods, such as naive model distillation, often fail to achieve high accuracy. To address this limitation, we introduce our Branch-Merge distillation approach: First, domain-specific knowledge from a large teacher model is selectively distilled into different student expert models; then, we merge these student experts in order to build a generalized model with cross-domain knowledge. With our distillation approach, we create TinyR1-32B-Preview, which outperforms the original student across multiple benchmarks, including Mathematics (+5.5), Coding (+4.4) and Science (+2.9), and achieves comparable performance to DeepSeek-R1 on AIME 2024. Our Branch-Merge distillation provides a novel solution for creating smaller, high-performing LLMs with reduced computational cost and time.
Paper Type: Short
Research Area: LLM Efficiency
Research Area Keywords: distillation, parameter-efficient-training, LLM Efficiency
Contribution Types: NLP engineering experiment, Approaches low compute settings-efficiency, Publicly available software and/or pre-trained models
Languages Studied: English
Submission Number: 9373
Loading