Unlocking The Power Of Layer-By-Layer Training And Fine- Tuning

Unlocking The Power Of Layer-By-Layer Training And Fine- Tuning

TMLR Paper7218 Authors

28 Jan 2026 (modified: 11 Jun 2026)Decision pending for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Layer-wise (LW) and segmented training reduce memory by restricting gradient propagation, but often suffer convergence degradation. We propose \emph{Segmented Propagation (SegProp)}, which keeps a small, trainable \emph{global head} (final layers + task head) active on the loss path throughout training, while updating only the current segment plus this shared head at each stage. This induces depth-wise gradient sparsity and reduces peak activation/optimizer footprint. Empirically, SegProp substantially closes the LW vs. End-to-End (E2E) gap on ResNet-18/50 for CIFAR-10 and achieves competitive performance under harder ImageNet-scale training with ViT, quantifying a clear accuracy--time--memory frontier as global-head depth and segmentation granularity vary. We further provide a system-level feasibility study on LLaMA-70B with 8$\times$40\,GiB GPUs, showing that SegProp enables larger feasible batches than FSDP with CPU offload and characterizing the resulting compute--memory trade-off via a detailed FLOPs analysis. Finally, we show that, in the evaluated 7--12B fine-tuning setups, SegProp matches or nearly matches end-to-end fine-tuning across downstream evaluations.

Submission Type: Regular submission (no more than 12 pages of main content)

Changes Since Last Submission: Updated camera ready paper. Summary of changes (v2 → v3) 1) Related Work (new baselines added) - Added Sec. 2.5 “Layer-selective fine-tuning”: discusses LISA and OWS, clarifies they are complementary to SegProp, and notes potential composability (layer selection within the active segment/head). - Added Sec. 2.6 “Memory-efficient full-parameter optimization”: discusses BAdam, clarifies how it differs from SegProp, and notes potential composability (block-wise optimizer within the trainable segment/head). 2) Clarification of the 70B section’s scope (avoid over-claiming) - Updated the Limitations/discussion text to state that the LLaMA-70B results are a feasibility + compute–memory trade-off study (batch size, memory traces, FLOPs) and do not claim improved pre-training quality (e.g., perplexity). [1](https://qualcomm-my.sharepoint.com/personal/slandis_qti_qualcomm_com/Documents/Microsoft%20Copilot%20Chat%20Files/Paper%20v3.pdf)[2](https://qualcomm-my.sharepoint.com/personal/slandis_qti_qualcomm_com/Documents/Microsoft%20Copilot%20Chat%20Files/Paper%20v2.pdf) 3) References - Updated the reference list to include the newly cited works (LISA, OWS, BAdam). Summary of changes (Original → v2) 1) Reframed the abstract and positioning - The abstract was rewritten to center SegProp as “persistent global head + segmented training” and to broaden the scope beyond CIFAR/LLM fine-tuning (original emphasized HSIC framing and “modified LW enables E2E”; v2 emphasizes staged training with a persistent head and a broader accuracy–time–memory trade-off story). 2) Expanded the experimental scope substantially - Added ImageNet-scale ViT experiments with explicit accuracy–time–memory trade-offs vs segmentation granularity and global-head depth (new Sec. 4.3, Fig. 4, Table 2 in v2). - Added ImageNet CNN memory analysis on ResNet-101, including comparisons with/without gradient checkpointing and discussion of when segmentation adds savings beyond checkpointing (new Sec. 4.4 and Table 1 in v2). - Added a system-level feasibility + memory trace study for LLaMA-70B on 8×40GiB GPUs, comparing FSDP+offload vs SegProp and reporting max feasible batch sizes and measured throughput (new Sec. 4.5 and Tables 3–4 in v2). - Added a detailed FLOPs accounting comparing FSDP vs SegProp for LLaMA-70B, quantifying the compute–memory trade-off (new Sec. 4.6 and Table 5 in v2). 3) Related Work expanded and reorganized - Added a Related Work subsection on Progressive Growth Transformers (PGT) and an explicit comparison to SegProp (new Sec. 2.4 in v2). - Revised the “information bottleneck / HSIC” discussion to be more cautious and explicitly framed as qualitative motivation rather than a formal/measured analysis. 4) Method description and appendices updated for the broadened scope - Updated the narrative in the problem setting/method sections to align with the broader evaluation scope (CNNs + ViT/ImageNet + LLM system feasibility) and to clarify the persistent global head and staged segment updates as the core mechanism. - Added/expanded appendices to cover the new ViT/ImageNet setup and the LLaMA-70B system setup and measurement protocol (Appendix F for ViT/ImageNet and Appendix D for LLaMA-70B in v2). 5) Shifted the memory story from a small CIFAR-10 table to ImageNet + 70B analyses - The original included a dedicated “ResNet-50/CIFAR-10 peak memory” table and discussion; v2 broadens the memory evidence with ImageNet-scale ResNet-101 and ViT trade-offs, plus 70B feasibility and FLOPs analysis.

Assigned Action Editor: ~Quanquan_Gu1

Submission Number: 7218

Loading