Never Saddle: Reparameterized Steepest Descent as Mirror Flow

Never Saddle: Reparameterized Steepest Descent as Mirror Flow

ICLR 2026 Conference Submission4878 Authors

13 Sept 2025 (modified: 21 Nov 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Implicit bias, mirror flow, sign gradient descent, Adam, AdamW, steepest descent, reparameterization, diagonal linear networks

TL;DR: The connection between reparameterizations and steepest mirror flows shows that the geometry of steepest descent is directly shaped, affecting feature learning by enabling saddlepoint escape, promoting sparsity, and stabilizing invariances.

Abstract: How does the choice of optimization algorithm shape a model’s ability to learn features? To address this question for steepest descent methods—including sign descent, which is closely related to Adam—we introduce steepest mirror flows as a unifying theoretical framework. This framework reveals how optimization geometry governs learning dynamics, implicit bias, and sparsity and it provides two explanations for why Adam and AdamW often outperform SGD in fine-tuning. Focusing on diagonal linear networks and deep diagonal linear reparameterizations (a simplified proxy for attention), we show that steeper descent facilitates both saddle-point escape and feature learning. In contrast, gradient descent requires unrealistically large learning rates to escape saddles, an uncommon regime in fine-tuning. Empirically, we confirm that saddle-point escape is a central challenge in fine-tuning. Furthermore, we demonstrate that decoupled weight decay, as in AdamW, stabilizes feature learning by enforcing novel balance equations. Together, these results highlight two mechanisms how steepest descent can aid modern optimization.

Supplementary Material: zip

Primary Area: learning theory

Submission Number: 4878

Loading