Unlocking the Power of Layer By Layer Training For LLM

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: layer-wise training, large language models, segmented propagation, SegProp, information bottleneck, HSIC, efficient training, activation checkpointing, early exit
TL;DR: SegProp restores global supervision in layer‑wise LLM training by always training with the last layers, yielding faster convergence and early‑exit, and in at least one configuration matches or exceeds end to end training.
Abstract: Layer-wise (LW) training of deep neural networks has long been associated with memory and parallelism advantages, yet it suffers from information degradation and poor convergence in deep architectures. Recent work attributes these issues to the loss of input information and the lack of layer-role differentiation, as measured by the Hilbert-Schmidt Independence Criterion (HSIC). In this paper, we present a novel algorithm that enables full end-to-end training of ResNet-18/ResNet-50 and end-to-end fine-tuning of Large Language Models (LLMs) using a modified LW approach, while minimizing performance degradation. Our fundamental contribution lies in the discovery that strategically reintroducing the final layers during LW training mitigates the convergence degradation typically observed during LW when compared to conventional end-to-end fine-tuning. We introduce Segmented Propagation (SegProp), a training paradigm that seamlessly integrates the computational efficiency of LW optimization with the representational power of global supervision. Quantitative results demonstrate substantial improvements in convergence compared to standard LW fine-tuning of LLMs and compared to LW training of ResNet-18/ResNet-50. SegProp improves ResNet-50 accuracy on CIFAR-10 from 90.0\% (LW) to 94.3\%, approaching E2E training at 95.5\%. On ResNet-18, SegProp improves CIFAR-10 accuracy from 93.7\% (LW) to 95.2\%, closely matching E2E at 95.5\%. On Mistral-Nemo-Instruct-2407, SegProp segmented fine-tuning matches E2E MMLU (5-shot) performance (69.3\%), and for Llama3.1-8B-Instruct it achieves 78.9\% on Winogrande (5-shot), closely matching E2E fine-tuning at 79.1\%.
Primary Area: optimization
Submission Number: 12239
Loading