The Momentum Persistence Effect: A New Theory for Why Soft Constraints Outperform Hard Projections

ICLR 2026 Conference Submission20639 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Constrained Optimization, Deep Learning Theory, Optimization Dynamics, Momentum Methods, Orthogonal Constraints, Regularization, Stiefel Manifold
TL;DR: We introduce 'momentum corruption' to theoretically prove why soft, penalty-based constraints often outperform exact, hard projections in deep learning.
Abstract: A persistent empirical puzzle in deep learning is why soft, penalty-based constraints often outperform their mathematically exact, hard-projected counterparts. While classical optimization theory provides elegant models, it fails to explain this phenomenon. This paper resolves the mystery by identifying a fundamental, theoretically unaccounted-for mechanism: the momentum persistence effect. We demonstrate that the classical theory assumes optimizer momentum resets after each projection, an assumption contradicted by standard implementations, such as Adam and SGD. Through controlled experiments on a tractable quadratic problem, we first show that the \textit{``momentum reset"} model fails catastrophically, under-predicting corruption magnitudes by orders of magnitude and misjudging scaling laws with respect to learning rate, projection frequency, and problem conditioning. We then isolate the cause through a crucial experiment: when momentum persists across projections, as in practice, the inherited optimizer state compounds corruption, leading to saturation at levels orders of magnitude higher than in memory-less cycles. Our corrected model accurately predicts this saturation and explains the observed super-linear scaling relationships. We further validate these principles in large-scale Transformer models using \textit{Orthogonal Subspace Projection Attention (OSPA)}, confirming that momentum persistence has a significant impact on performance, particularly in high-noise, low-data scenarios. Our discovery reveals a critical blind spot in constrained optimization theory and provides key design principles for practitioners: prefer soft constraints when possible, and when hard projections are necessary, co-design them with optimizer choice to minimize momentum corruption effects.
Primary Area: optimization
Submission Number: 20639
Loading