A Token is Worth over 1,000 Tokens: Efficient Knowledge Distillation through Low-Rank Clone

Jitai Hao; Qiang Huang; Hao Liu; Xinyan Xiao; Zhaochun Ren; Jun Yu

A Token is Worth over 1,000 Tokens: Efficient Knowledge Distillation through Low-Rank Clone

Jitai Hao, Qiang Huang, Hao Liu, Xinyan Xiao, Zhaochun Ren, Jun Yu

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 spotlightEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLMs, Knowledge Distillation, Pruning, Efficient Pre-training, Model Compression

TL;DR: We "clone" large LLMs into small SLMs by training only low-rank projection matrices for weights and making all student activations identical to the teacher's. This yields comparable SLM performance with 1000x fewer training tokens.

Abstract: Training high-performing Small Language Models (SLMs) remains computationally expensive, even with knowledge distillation and pruning from larger teacher models. Existing approaches often face three key challenges: (1) information loss from hard pruning, (2) inefficient alignment of representations, and (3) underutilization of informative activations, particularly from Feed-Forward Networks (FFNs). To address these challenges, we introduce \textbf{Low-Rank Clone (LRC)}, an efficient pre-training method that constructs SLMs aspiring to behavioral equivalence with strong teacher models. LRC trains a set of low-rank projection matrices that jointly enable soft pruning by compressing teacher weights, and activation clone by aligning student activations, including FFN signals, with those of the teacher. This unified design maximizes knowledge transfer while removing the need for explicit alignment modules. Extensive experiments with open-source teachers such as Llama-3.2-3B-Instruct and Qwen2.5-3B/7B-Instruct show that LRC matches or surpasses the performance of state-of-the-art models trained on trillions of tokens--using only 20B tokens, achieving over \textbf{1,000$\times$} greater training efficiency. Our codes and model checkpoints are available at https://github.com/CURRENTF/LowRankClone and https://huggingface.co/JitaiHao/LRC-4B-Base.

Supplementary Material: zip

Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)

Submission Number: 8862

Loading