Unlocking The Power Of Layer-By-Layer Training And Fine- Tuning

TMLR Paper7218 Authors

28 Jan 2026 (modified: 17 Feb 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Layer-wise (LW) training of deep neural networks has long been associated with memory and parallelism advantages, yet it suffers from information degradation and poor convergence in deep architectures. Recent work attributes these issues to the loss of input information and the lack of layer-role differentiation, as measured by the Hilbert-Schmidt Independence Criterion (HSIC). In this paper, we present a novel algorithm that enables full end-to-end training of ResNet- 18/ResNet-50 and end-to-end fine-tuning of Large Language Models (LLMs) using a modified LW approach, while minimizing performance degradation. Our fundamental contribution lies in the discovery that strategically reintroducing the final layers during LW training mitigates the convergence degradation typically observed during LW when compared to conventional end-to-end fine-tuning. We introduce Segmented Propagation (SegProp), a training paradigm that seamlessly integrates the computational efficiency of LW optimization with the representational power of global supervision. Quantitative results demonstrate substantial improvements in convergence compared to standard LW fine-tuning of LLMs and compared to LW training of ResNet-18/ResNet-50. SegProp improves ResNet-50 accuracy on CIFAR-10 from 90.0% (LW) to 94.3%, approaching E2E training at 95.5%. On ResNet-18, SegProp improves CIFAR-10 accuracy from 93.7% (LW) to 95.2%, closely matching E2E at 95.5%. On Mistral- Nemo-Instruct-2407, SegProp segmented fine-tuning matches E2E MMLU (5-shot) performance (69.3%), and for Llama3.1-8B-Instruct it achieves 78.9% on Winogrande (5-shot), closely matching E2E fine-tuning at 79.1%.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Quanquan_Gu1
Submission Number: 7218
Loading