Chain-of-Model Learning for Language Model

Xiaohua Wang; Kaitao Song; Xu Tan; Huiqiang Jiang; Chengruidong Zhang; Yongliang Shen; Cen LU; Zihao Li; Zifan Song; Caihua Shan; Yansen Wang; Kan Ren; Xiaoqing Zheng; Tao Qin; Yuqing Yang; Dongsheng Li; Lili Qiu

Chain-of-Model Learning for Language Model

Xiaohua Wang, Kaitao Song, Xu Tan, Huiqiang Jiang, Chengruidong Zhang, Yongliang Shen, Cen LU, Zihao Li, Zifan Song, Caihua Shan, Yansen Wang, Kan Ren, Xiaoqing Zheng, Tao Qin, Yuqing Yang, Dongsheng Li, Lili Qiu

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Foundation Model, Next-Gen Language Models, Chain-of-Model, Dynamic Activation, Efficiency

Abstract: In this paper, we propose a novel learning paradigm, termed *Chain-of-Model* (CoM), which incorporates the causal relationship into the hidden states of each layer as a chain style. thereby introducing great scaling efficiency in model training and inference flexibility in deployment.We introduce the concept of *Chain-of-Representation* (CoR), which formulates the hidden states at each layer as a combination of multiple sub-representations (i.e., chains). In each layer, each chain from the output representations can only view all of its preceding chains in the input representations. Consequently, the model built upon CoM framework can progressively scale up the model size by increasing the chains based on the previous models (i.e., chains), and offer multiple sub-models at varying sizes for elastic inference by using different chain numbers. Based on this principle, we devise *Chain-of-Language-Model* (CoLM), which incorporates the idea of CoM into each layer of Transformer architecture. Based on CoLM, we further introduce CoLM-Air by introducing a *KV sharing* mechanism, that computes all keys and values within the first chain and then shares across all chains. This design demonstrates additional extensibility, such as enabling seamless LM switching, prefilling acceleration and so on. Experimental results demonstrate our CoLM family can achieve comparable performance to the standard Transformer, while simultaneously enabling greater flexiblity, such as progressive scaling to improve training efficiency and offer multiple varying model sizes for elastic inference, paving a a new way toward building language models.

Supplementary Material: zip

Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)

Submission Number: 4761

Loading