Unlocking the Power of Layer By Layer Training For LLM

ICLR 2026 Conference Submission12239 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: layer-wise training, large language models, segmented propagation, SegProp, information bottleneck, HSIC, efficient training, activation checkpointing, early exit
TL;DR: SegProp restores global supervision in layer‑wise LLM training by always training with the last layers, yielding faster convergence and early‑exit, and in at least one configuration matches or exceeds end to end training.
Abstract: Layer-wise (LW) training of deep neural networks has long been associated with memory and parallelism advantages, yet it suffers from information degradation and poor convergence in deep architectures. Recent work attributes these issues to the loss of input information and the lack of layer-role differentiation, as measured by the Hilbert-Schmidt Independence Criterion (HSIC). In this paper, we present a novel algorithm that enables full end-to-end training of Large Language Models (LLMs) using a LW approach, while minimizing performance degradation. Through a comprehensive set of new experimental results, we demonstrate that although prior work has shown LW training to be effective in shallow architectures such as ResNet, its direct application to GPT-style LLMs leads to significant information loss and severely impaired convergence. Our fundamental contribution lies in the discovery that strategically reintroducing the final layers during LW training not only mitigates the convergence degradation typically observed in GPT-style LLMs but can in fact surpass the performance of conventional end-to-end training. This breakthrough unlocks a new paradigm for scalable optimization of deep transformer architectures, offering a powerful framework for training large models with improved efficiency, stability, and resource utilization. We introduce Segmented Propagation (SegProp), a novel training paradigm that seamlessly integrates the computational efficiency of LW optimization with the representational power of global supervision. SegProp also introduces early-exit opportunities during training, enabling model compression. Quantitative results demonstrate substantial improvements in convergence compared to standard LW training. Finally, we position SegProp within the broader literature on information bottleneck theory, LW training, and early-exit strategies, and discuss its implications for scalable, energy efficient AI training and inference.
Primary Area: optimization
Submission Number: 12239
Loading