Keywords: layer-wise training, large language models, segmented propagation, SegProp, information bottleneck, HSIC, efficient training, activation checkpointing, early exit
TL;DR: SegProp restores global supervision in layer‑wise LLM training by always training with the last layers, yielding faster convergence and early‑exit, and in at least one configuration matches or exceeds end to end training.
Abstract: Layer-wise (LW) training of deep neural networks has long been associated with
memory and parallelism advantages, yet it suffers from information degradation
and poor convergence in deep architectures. Recent work attributes these issues to
the loss of input information and the lack of layer-role differentiation, as measured
by the Hilbert-Schmidt Independence Criterion (HSIC).
In this paper, we present a novel algorithm that enables full end-to-end training of
Large Language Models (LLMs) using a LW approach, while minimizing performance degradation. Through a comprehensive set of new experimental results,
we demonstrate that although prior work has shown LW training to be effective in shallow architectures such as ResNet, its direct application to GPT-style
LLMs leads to significant information loss and severely impaired convergence.
Our fundamental contribution lies in the discovery that strategically reintroducing
the final layers during LW training not only mitigates the convergence degradation typically observed in GPT-style LLMs but can in fact surpass the performance
of conventional end-to-end training. This breakthrough unlocks a new paradigm
for scalable optimization of deep transformer architectures, offering a powerful
framework for training large models with improved efficiency, stability, and resource utilization. We introduce Segmented Propagation (SegProp), a novel training paradigm that seamlessly integrates the computational efficiency of LW optimization with the representational power of global supervision. SegProp also
introduces early-exit opportunities during training, enabling model compression.
Quantitative results demonstrate substantial improvements in convergence compared to standard LW training. Finally, we position SegProp within the broader
literature on information bottleneck theory, LW training, and early-exit strategies,
and discuss its implications for scalable, energy efficient AI training and inference.
Primary Area: optimization
Submission Number: 12239
Loading