Memory Savings at What Cost? A Study of Alternatives to Backpropagation

Memory Savings at What Cost? A Study of Alternatives to Backpropagation

ICLR 2026 Conference Submission13303 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Backpropagation, Forward-mode Auto Differentiation, Zero order optimization, Gradient Computation, Gradient Estimation

TL;DR: Forward-mode AD and zero-order methods have been proposed as memory-saving alternatives to backprop (BP), but prior work ignores checkpointed BP. We show it matches their memory use while outperforming them in accuracy, speed, and compute.

Abstract: Forward-mode automatic differentiation (FmAD) and zero-order (ZO) optimization have been proposed as memory-efficient alternatives to backpropagation (BP) for gradient computation, especially in low-resource settings. However, their practical benefits remain unclear due to two key gaps: a lack of comparison against memory-efficient BP variants, such as activation checkpointing, and a lack of systematic characterization of tradeoffs between accuracy, memory, and computation efficiency when comparing these methods. This work presents a comprehensive empirical comparison of BP, FmAD, and ZO methods. We first conduct theoretical analysis to present intuitions that, while FmAD and ZO can reduce memory usage, they incur significant costs in accuracy, convergence speed, and computation compared to BP with checkpointing. These drawbacks worsen with larger models or constrained perturbation budgets. Empirical experiments on large language and vision-language models show that BP with checkpointing outperforms FmAD and ZO variants, including those enhanced with variance reduction, achieving up to 31.1% higher accuracy, 34.8% faster convergence, and 3.8$\times$ fewer computations at comparable memory usage. We also investigate specific failure modes in FmAD and ZO, including instabilities in Jacobian-vector products that can destabilize training. Our results highlight fundamental limitations of FmAD and ZO, and the effectiveness of BP with checkpointing for model training, under memory-constrained settings.

Primary Area: optimization

Submission Number: 13303

Loading