Abstract: Layer-wise (LW) training of deep neural networks has long been associated with memory and
parallelism advantages, yet it suffers from information degradation and poor convergence
in deep architectures. Recent work attributes these issues to the loss of input information
and the lack of layer-role differentiation, as measured by the Hilbert-Schmidt Independence
Criterion (HSIC).
In this paper, we present a novel algorithm that enables full end-to-end training of ResNet-
18/ResNet-50 and end-to-end fine-tuning of Large Language Models (LLMs) using a modified
LW approach, while minimizing performance degradation. Our fundamental contribution
lies in the discovery that strategically reintroducing the final layers during LW training
mitigates the convergence degradation typically observed during LW when compared to
conventional end-to-end fine-tuning.
We introduce Segmented Propagation (SegProp), a training paradigm that seamlessly integrates
the computational efficiency of LW optimization with the representational power
of global supervision. Quantitative results demonstrate substantial improvements in convergence
compared to standard LW fine-tuning of LLMs and compared to LW training of
ResNet-18/ResNet-50. SegProp improves ResNet-50 accuracy on CIFAR-10 from 90.0%
(LW) to 94.3%, approaching E2E training at 95.5%. On ResNet-18, SegProp improves
CIFAR-10 accuracy from 93.7% (LW) to 95.2%, closely matching E2E at 95.5%. On Mistral-
Nemo-Instruct-2407, SegProp segmented fine-tuning matches E2E MMLU (5-shot) performance
(69.3%), and for Llama3.1-8B-Instruct it achieves 78.9% on Winogrande (5-shot),
closely matching E2E fine-tuning at 79.1%.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Quanquan_Gu1
Submission Number: 7218
Loading