\section{INTRODUCTION}
Mirror descent (MD) is an optimization method that extends gradient descent (GD) beyond Euclidean geometries \citep{DarzentasJohn1984PCaM}. Central to the MD framework is a mirror map that facilitates transformation between a primal space where iterates exist and a dual space where updates are performed. By defining an appropriate mirror map, MD can adapt to the geometry of the problem for efficient optimization. Since its introduction, MD has attracted considerable research interest in its regularization properties and has motivated development of efficient optimization algorithms \citep{beck2003mirror, radhakrishnan2020linear, azizan2021stochastic, gunasekar2021mirrorless, sun2022mirror, sun2023unified}.

Recent studies reveal the power of adopting an MD perspective to interpret the optimization dynamics of GD for overparameterized problems \citep{woodworth2020kernel, li2022implicit}. In an overparameterized setting, the number of parameters exceeds the number of examples, resulting in an underdetermined system and infinitely many solutions. This becomes an important setting for analyzing the behavior of optimization algorithms and characterizing the particular solutions they converge to among all solutions \citep{allen2019convergence, oymak2019overparameterized}. Given a parameterization of a problem, \citet{woodworth2020kernel, li2022implicit} formulate mirror maps that establish equivalence between GD dynamics and low-dimensional MD dynamics. The simplified dual dynamics lead to a characterization of the convergent solution among all solutions in terms of the Bregman divergence. Specifically, the convergent GD solution minimizes the Bregman divergence from the starting point. This method is further used to analyze the effects of the initialization shape \citep{azulay2021implicit} and stochasticity \citep{pesme2021implicit} on the convergent solution.

Such results have been shown on data where optimal solutions are easy to find, yet the underlying optimization dynamics are nontrivial. While the underlying dynamics for gradient descent have been examined, the dual dynamics for other popular gradient based methods remain less understood. The MD framework provides a powerful and elegant tool for analyzing high-dimensional optimization dynamics, however, the existence of such mirror maps is highly dependent on both the problem parameterization and the optimization algorithm. Existing analyses do not apply to algorithms beyond (stochastic) GD. The challenges arise from both the formulation of a mirror map and the analysis of dual dynamics. For example, for adaptive gradient methods with coordinate-wise adaptive learning rates, the update directions deviate from the true gradients. The adaptivity alters the underlying dynamics, breaking the low-dimensional structure seen in GD and rendering existing approaches inapplicable. Our work addresses this limitation and proposes a method to apply the MD framework when update directions do not follow true gradients. 

Among adaptive gradient descent methods, we examine a prototypical algorithm, smoothed sign descent, which can be viewed as a smoothed version of sign descent with a stability constant $\varepsilon$. Recent work shows a deep connection between smoothed sign descent and popular optimizers such as ADAM and RMSProp \citep{kunstner2023noise, ma2022qualitative, balles2018dissecting, bernstein2018signsgd}. While sign descent has been studied as a proxy to understand the dynamics of more complex adaptive gradient methods \citep{ma2023understanding, balles2020geometry}, studies \citep{wang2021implicit, wang2022does} show that the stability constant plays a key role in determining the convergence direction for classification problems. This highlights the importance of studying smoothed sign descent and investigating the effect of the stability constant $\varepsilon$, which has been underexplored in literature. We study the dynamics of smoothed sign descent for a quadratically parameterized regression problem. Our results reveal dual dynamics that are distinct from those for GD, and explicitly characterize the relationship between the stability constant $\varepsilon$ and the convergent solution.

In this work, we present an analysis of MD to interpret the optimization dynamics of smoothed sign descent. We identify an initial stage unique to smoothed sign descent, which allows us to formulate a mirror map for the main stage of the dynamics. Using the mirror map, we project the complex primal dynamics onto the dual space with a simplified structure. We further decompose the dual dynamics into a sign descent stage and a convergence stage. The dual dynamics interpretation enables us to connect the convergent solution to the approximate minimizer of a Bregman divergence style function closely related to the $l_{3/2}$-norm. Further analysis reveals the effect of the stability constant $\varepsilon$ on reducing the deviation from the exact minimizer, corroborating the empirical findings on the sensitivity of the training and testing performance to the stability constant \citep{de2018convergence,liu2019variance,choi2019empirical}. 

Our analysis introduces a three-stage decomposition of the complex dynamics, where each phase exhibits distinct characteristics. By carefully studying the behavior within each phase, we establish unique regularization properties of the convergent solution. However, to make the analysis of the underlying coupled nonlinear ODE system tractable, we adopt simplifying assumptions that may limit the direct applicability of our results to real-world settings.

Our contributions are as follows.
\begin{itemize}
    \item We introduce the dual dynamics of smoothed sign descent for a quadratically parameterized regression problem using the MD framework.
    \item We show that after an initial stage, the dual dynamics begin a sign descent stage characterized by approximately linear growth with similar rates in all coordinates, and then transition into a convergence stage characterized by diminishing magnitude of gradients.
    \item We prove that the convergent solution approximately satisfies the KKT conditions for minimizing a Bregman divergence style function, in contrast to the already known exact Bregman divergence minimization property of GD dynamics. The convergent solution found by smoothed sign descent is the one that approximately minimizes the Bregman divergence style function from the starting point.
    \item We theoretically analyze the effect of the stability constant $\varepsilon$ on bounding the deviation from the exact minimizer, emphasizing the benefit of tuning the stability constant. 
\end{itemize}
In Section 2, we review previous research on the properties of MD and smoothed sign descent. In Section 3, we present our main results, including the formulation of dual dynamics and the characterization of convergent solutions. We conclude the paper in Section 4. 

\section{RELATED WORK}
Recent works apply the MD framework to interpret dynamics of neural network training. The study \citep{woodworth2020kernel} discovers the equivalent low-dimensional MD dynamics for the optimization dynamics of GD for overparameterized models, focusing on the effect of initialization scale. However, extending their methodology to more general cases remains a challenge. \citet{li2022implicit} identify a commutative property of neural network parameterization that enables the formulation of equivalent MD dynamics. \citet{pesme2021implicit} use a time-varying mirror map for stochastic GD and show the benefit of stochasticity for inducing sparsity of the convergent solution. \citet{azulay2021implicit} propose a warping technique to study the effect of the initialization shape on the equivalent MD dynamics of GD. We contribute to this line of research dealing with strict gradients by extending the framework beyond GD to a case where the adaptive learning rate breaks the gradient structure and showing distinct properties of the dual dynamics.

Research on regularization properties of MD algorithms dates back to the work \citep{beck2003mirror}, which reveals a local regularization effect in terms of Bregman divergence at each iteration. Recent study \citep{gunasekar2018characterizing} shows that MD converges to the solution that minimizes the associated Bregman divergence from the starting point among all solutions. Subsequent works \citep{azizan2018stochastic, azizan2021stochastic} extend this analysis to stochastic MD for nonlinear models and prove the Bregman divergence minimization property. Research so far primarily focuses on standard MD settings, where the dynamics follow the gradient directions in the dual space. In contrast, we study the case where the dual dynamics deviate from the gradients. We show that the convergent solution of smoothed sign descent satisfies the approximate KKT condition of minimizing a Bregman divergence style function by bounding the cumulative deviation. 

The stability constant $\varepsilon$, designed to ensure numerical stability for algorithms such as ADAM and RMSProp, is typically set to a negligible value by default. Its impact on optimization dynamics is underexplored. \citet{de2018convergence} experiment with different values of $\varepsilon$ for ADAM and RMSProp and observe that training and testing performance is sensitive to $\varepsilon$. Studies \citep{nado2020evaluating,liu2019variance,choi2019empirical} also provide empirical evidence supporting the benefit of tuning the stability constant $\varepsilon$. \citet{yuan2020eadam} study the effect of modifying the location of $\varepsilon$ in ADAM and propose an alternative optimizer to improve performance. We provide a theoretical justification for tuning the stability constant $\varepsilon$ by explicitly showing its role in reducing the KKT error of the convergent solution. \citet{carmon2022making} introduce an algorithm for stochastic convex optimization, and
show the role of $\varepsilon$ in the regret bound, while our work reveals the role of $\varepsilon$ in shaping the solution found by smoothed sign descent.


Our work also contributes an MD perspective to the ongoing discussion on the implicit regularization phenomenon in neural network training \citep{neyshabur2014search, zhang2021understanding}. While many studies \citep{soudry2018implicit, arora2019implicit, lyugradient} focus on GD, fewer have investigated adaptive gradient methods despite the performance gap observed in the paper \citep{wilson2017marginal}. Notably, studies \citep{wang2021implicit, wang2022does} find that ADAM achieves the same convergent direction as GD in classification problems, while we prove a distinct regularization property for smoothed sign descent compared to GD in regression problems. Recent study \citep{xie2024implicit} characterizes the convergent solution of AdamW as training time approaches infinity. In contrast, we characterize the entire dynamics of smoothed sign descent by formulating the equivalent dual dynamics which reveal an intrinsically simplified structure. We propose a three-stage decomposition of the dual dynamics that enables an in-depth analysis of the optimization dynamics. A related but different two-stage transition is observed empirically by \citep{ma2022qualitative} when optimizing a squared loss with Adam for fully connected neural networks, which exhibits an initial phase of fast convergence followed by oscillations, spikes, or a diverging pattern. While the initial phase is similar to our sign descent stage with sufficient update across all coordinates, we reveal a different behavior in the latter stages of convergence.

\section{DUAL DYNAMICS OF SMOOTHED SIGN DESCENT}
\subsection{Background}
Let us consider the update rule of GD for minimizing a loss function $L(\bm\beta)$ with step size $\eta > 0$:
\begin{equation}
    \bm\beta_{t+1} = \bm\beta_t - \eta \nabla L(\bm\beta_t).\label{eq:GDupdaterule}
\end{equation}
We suppose that the iterates $\bm\beta_t$ lie in the Euclidean space $\mathbb{R}^D$. Formally, the gradients $\nabla L(\bm\beta_t)$ lie in the dual space $\mathbb{R}^D$. In GD, we obtain the updated point by directly taking a linear combination of the iterate and the gradient as in \eqref{eq:GDupdaterule}. MD, however, formally distinguishes the primal and the dual spaces using a mirror map to transform between them. A mirror map $\nabla \Phi: \mathbb{R}^D \to \mathbb{R}^D$ is defined as the gradient of a potential function $\Phi: \mathbb{R}^D \to \mathbb{R}$, which is a differentiable and strictly convex function. The mirror map $\nabla \Phi$ maps the primal variable $\bm\beta$ to the dual variable denoted by $\bm\phi \in \mathbb{R}^D$. Each iteration of MD for minimizing $L(\bm\beta)$ follows the following steps, where the step size $\eta > 0$:
\begin{align}
    \bm\phi_t &= \nabla \Phi (\bm\beta_t) \label{eq:dualvariable} \\
    \bm\phi_{t+1} &= \bm\phi_t - \eta \nabla L(\bm\beta_t) \label{eq:discreteMDupdate}\\
    \bm\beta_{t+1} &= (\nabla \Phi)^{-1}(\bm\phi_{t+1}). \label{eq:backtoprimalvariable}
\end{align}
By plugging in \eqref{eq:dualvariable}, we can rewrite the MD update \eqref{eq:discreteMDupdate} in the dual space as: 
\begin{equation}
    \nabla \Phi(\bm\beta_{t+1}) = \nabla \Phi(\bm\beta_t) - \eta \nabla L(\bm\beta_t).
\end{equation}
In the continuous-time limit when $\eta \to 0$, we get the \textbf{dual dynamics} of $\bm \beta(t)$:
\begin{equation}
    \frac{d\nabla \Phi(\bm\beta(t))}{dt} = -\nabla L(\bm\beta(t)). \label{eq:MDdualdynamics}
\end{equation}

A key element of MD is the Bregman divergence that serves as the notion of measuring the distance between two points in the primal space.
\begin{definition}[Bregman divergence]
For $\bm\beta_1, \bm\beta_2 \in \mathbb{R}^D$, the Bregman divergence associated with a potential function $\Phi$ from $\bm\beta_1$ to $\bm\beta_2$ is defined as
\begin{equation}
    D_{\Phi}(\bm\beta_1, \bm\beta_2) = \Phi(\bm\beta_1) - \Phi(\bm\beta_2) - \langle \bm\beta_1 - \bm\beta_2, \nabla \Phi(\bm\beta_2)\rangle.
\end{equation}
\end{definition} Bregman divergence generalizes squared Euclidean distance and captures different geometric structure of the space through the choice of $\Phi$. When $\Phi(\bm\beta) = \frac{1}{2}\Vert \bm\beta \Vert_2^2$, the associated Bregman divergence reduces to the squared Euclidean distance, the mirror map $\nabla \Phi$ becomes an identity map, and MD simplifies to GD.

\subsection{Problem Setup}
We suppose that there are $N$ examples with $D > N$ features $\{(\bm{x}^{(i)}, y^{(i)})\}_{i=1, ..., N}$, where $\bm{x}^{(i)} \in \mathbb{R}^D, y^{(i)} \in \mathbb{R}$. Let us denote the data matrix by $X \in \mathbb{R}^{N \times D}$, where each row is $\bm{x}^{(i)}$, and denote the labels of the examples by $\bm{y} \in \mathbb{R}^N$. The Hadamard product is denoted by $\odot$. We consider a regression problem of minimizing the following loss function with respect to $\bm{w} := \begin{bmatrix}
    \bm{w}^+ \\ \bm{w}^-
\end{bmatrix} \in \mathbb{R}^{2D}$, where $\bm{w}^+, \bm{w}^- \in \mathbb{R}^D$:
\begin{align}\label{eq:regressionproblem}
    L(\bm{w}) = \frac{1}{4}&\left(X\left(\bm{w}^+ \odot \bm{w}^+ - \bm{w}^- \odot \bm{w}^-\right) - \bm{y}\right)^\top \nonumber \\
    &\left(X\left(\bm{w}^+ \odot \bm{w}^+ - \bm{w}^- \odot \bm{w}^-\right) - \bm{y}\right).
\end{align}
We let $\bm{\beta} := \bm{w}^+ \odot \bm{w}^+ - \bm{w}^- \odot \bm{w}^- \in \mathbb{R}^D$ denote the regression parameter, and $L(\bm\beta) = \frac{1}{4}\left(X\bm\beta - \bm y\right)^\top\left(X\bm\beta - \bm y\right)$ is the standard quadratic loss. This parameterization of $\bm\beta$ by $\bm w$ can also be viewed as a 2-layer diagonal linear neural network with weights $\bm{w} \in \mathbb{R}^{2D}$ (see Section 4 of the paper \citep{woodworth2020kernel} for a detailed study of the model). Despite its simplicity, this setup has been used to prove numerous insightful results for neural networks training \citep{woodworth2020kernel,pesme2021implicit,nacson2022implicit,vivien2022label}. 

When GD is applied to minimize loss \eqref{eq:regressionproblem} with respect to $\bm w$, from the GD update rule with infinitesimal step size $\eta$ we get\begin{align}
    \frac{d\bm w^+(t)}{dt} &= -\nabla_{\bm w^+}L(\bm w(t)),\\
    \frac{d\bm w^-(t)}{dt} &= -\nabla_{\bm w^-}L(\bm w(t)).
\end{align} Using the chain rule, we get the optimization dynamics of $\bm\beta(t)$: \begin{align}
    \frac{d\bm\beta(t)}{dt} = &-2\bm w^+(t) \odot \nabla_{\bm w^+} L(\bm w(t))\nonumber \\
    &+ 2\bm w^-(t) \odot \nabla_{\bm w^-} L(\bm w(t)).
    \label{eq:GDprimaldynamics}
\end{align}

Previous work \citep{woodworth2020kernel} shows that by defining a potential function:\begin{equation}
    \Psi_{\alpha}(\bm\beta) := \frac{1}{4}\left (\sum_{i=1}^D \beta_i \operatorname{arcsinh}\left(\frac{\beta_i}{2\alpha^2}\right) - \sqrt{\beta_i^2 + 4\alpha^4}\right), \label{eq:GDpotential}
\end{equation}
where $\alpha > 0$ is the initialization scale, we can project the dynamics \eqref{eq:GDprimaldynamics} onto the dual space using the mirror map $\nabla\Psi_{\alpha}$. Here the gradient is taken with respect to $\bm\beta$. By derivation in Appendix~\ref{appendixC}, it follows that the dual dynamics are given by: 
\begin{equation}
    \frac{d\nabla\Psi_{\alpha}(\bm\beta(t))}{dt} = -\nabla_{\bm\beta} L(\bm\beta(t)).\label{eq:GDdualdynamics}
\end{equation}
Since \eqref{eq:GDprimaldynamics} and \eqref{eq:GDdualdynamics} are equivalent, in the continuous-time limit, the evolution of $\bm\beta(t)$ using GD can be interpreted as following the MD algorithm \eqref{eq:dualvariable}-\eqref{eq:backtoprimalvariable} with mirror map $\nabla\Psi_{\alpha}$.

The dual dynamics \eqref{eq:GDdualdynamics} reveal an intrinsically low-dimensional structure of the dynamics of $\bm\beta(t)$ in the overparameterized setting where $N < D$. Specifically, the gradients $\nabla_{\bm\beta}L(\bm\beta)$ in the right-hand side of \eqref{eq:GDdualdynamics} are confined in a subspace $\text{span}\{\bm{x}^{(1)}, ..., \bm{x}^{(N)}\}$, which has dimension of at most $N$. Furthermore, by analyzing the dual dynamics, previous work \citep{woodworth2020kernel} proves that the convergent solution $\bm{\beta}^{\infty} := \lim_{t\to\infty}\bm\beta(t)$ satisfies the KKT conditions of the constrained optimization problem:

\begin{equation}\label{eq:GDconvergentsolution}
    \bm\beta^{\infty} = \underset{\bm\beta \in \mathbb{R}^D \text{ s.t. } X\bm\beta = \bm{y}}{\operatorname{argmin}}\,D_{\Psi_\alpha}(\bm\beta, \bm\beta(0)).
\end{equation}

In this work, we study the dynamics of smoothed sign descent for minimizing \eqref{eq:regressionproblem}. For smoothed sign descent, the weights are updated according to 
\begin{equation}
    \bm{w}_{t+1} = \bm{w}_t - \eta \cdot \frac{\nabla_{\bm{w}} L(\bm{w}_t)}{|\nabla_{\bm{w}} L(\bm{w}_t)| + \varepsilon \bm{1}}\label{eq:smoothedsigndescent},
\end{equation}
where $\varepsilon > 0$ is the stability constant and the operations are taken element-wise. Smoothed sign descent can be viewed as an adaptive gradient method with coordinate-wise adaptive learning rate $\eta_{i, t} = \frac{\eta}{|\left[\nabla_{\bm{w}}L(\bm{w}_t) \right]_i| + \varepsilon}$ for each $i$. The magnitude of the gradient can differ vastly across all coordinates, and thus, the update in each coordinate is scaled differently. As a result, the update direction no longer follows the opposite of the true gradient, unlike normalized gradient descent, which preserves the direction by applying a uniform normalization scale to all coordinates.

We suppose that the weights are initialized by $\bm{w}(0) = \alpha \mathbf{1}$, $\alpha > 0$. In the continuous-time limit, the dynamics of the weights become
\begin{equation}
    \frac{d\bm{w}(t)}{dt} = -\frac{\nabla_{\bm{w}}L(\bm{w}(t))}{|\nabla_{\bm{w}}L(\bm{w}(t))| + \varepsilon\mathbf{1}} \label{eq:weightdynmics}.
\end{equation}
This yields the dynamics of the regression parameter $\bm\beta(t)$ as follows, with $\bm\beta(0) = \bm{0}$:
\begin{align}
    \frac{d\bm{\beta}(t)}{dt} = &-2\bm w^+(t) \odot \frac{\nabla_{\bm{w}^+}L(\bm{w}(t))}{|\nabla_{\bm{w}^+}L(\bm{w}(t))| + \varepsilon\mathbf{1}} \nonumber \\
    &+ 2\bm w^-(t) \odot \frac{\nabla_{\bm{w}^-}L(\bm{w}(t))}{|\nabla_{\bm{w}^-}L(\bm{w}(t))| + \varepsilon\mathbf{1}}. \label{eq:betadynmics}
\end{align}

With coordinate-wise adaptive learning rate, the update direction deviates from the true gradients and the mirror map $\nabla \Psi_{\alpha}$ for GD no longer holds. It leads to two interesting questions: \begin{enumerate}
    \item Can we formulate a mirror map to show equivalent dual dynamics for \eqref{eq:betadynmics}? 
    \item Can we use the dual dynamics to characterize the convergent solution among all solutions?
\end{enumerate}

\subsection{Main Results}
In this section, we present our answers to the two questions. Our results consist of three parts. In Propositions~\ref{proposition:dualdynamics} and~\ref{proposition:mainstage}, we construct a mirror map and formulate the dual dynamics for smoothed sign descent. In Theorem~\ref{maintheorem} and Corollary~\ref{corollary:NDcorollary}, we prove a characterization of the convergent solution. In Corollaries~\ref{corollary:epsilon} and~\ref{corollary:NDvarepsilon}, we further reveal the role of the stability constant in the convergent solution.


The weight dynamics \eqref{eq:weightdynmics} form a coupled system of nonlinear ODEs, with the stability constant $\varepsilon$ adding another layer of complexity. By the Picard-Lindelof theorem, there exists a unique solution to \eqref{eq:weightdynmics}. Solving this ODE system analytically is intractable. We make the following assumption to decouple the nonlinear ODE system into N autonomous systems. This decomposition allows us to analyze the interactions among an arbitrary number of dimensions within each system.

\begin{assumption}\label{assumption:data}
    We assume that $y^{(n)}$ are non-zero, and that there exists a permutation of the columns of $X$ such that $X^\top X$ is block-diagonal with $N$ rank-1 blocks denoted by $B^{(n)} \in \mathbb{R}^{D_n \times D_n}$ for $n = 1, \dots, N$.
\end{assumption}

It is easy to see that this condition is equivalent to requiring that each row of $X$ has $D_n \geq 1$ non-zero elements denoted by $x^{(n)}_1, \dots, x^{(n)}_{D_n}$, where $\sum_{n=1}^N D_n = D$. While this assumption yields an easy optimization problem in the primal space, the dynamics of smoothed sign descent are very complex and intriguing. 

We require the stability constant $\varepsilon$ to be small relative to components of the initial gradient so that it does not overshadow the essential behavior of the dynamics as a smoothed version of sign descent. We notice that $\bm{w} = \bm{0}$ is a stationary point of the weight dynamics \eqref{eq:weightdynmics}. Since the weights are initialized as $\bm{w}(0) = \alpha \bm{1}$ where $\alpha > 0$, we assume that $\alpha$ is chosen not so small to avoid being stuck near a stationary point. Moreover, we also avoid choosing a large initial value $\alpha$ that would dominate the value of the weights and overshadow the convergent behavior.

\begin{assumption}\label{assumption:epsilonalpha}
We assume that for each $n \in \{1, \dots, N\}$ and $i \in \{1, \dots, D_n\}$, the stability constant $\varepsilon$ and the initialization scale $\alpha$ satisfy: 
\begin{align}
    &0 \leq \varepsilon \leq \frac{1}{9}\frac{|x^{(n)}_i| |y^{(n)}|^{\frac{3}{2}}}{\sqrt{2\sum_{k=1}^{D_n} |x_k^{(n)}|}}, \\
    &\frac{9\varepsilon}{4\left|x^{(n)}_i y^{(n)}\right|} \leq \alpha \leq \frac{1}{3}\sqrt{\frac{|y^{(n)}|}{2\sum_{k=1}^{D_n} |x_k^{(n)}|}}.
\end{align}
\end{assumption}

\subsubsection{Three Stages}\label{section:threestages}
We begin by studying the sign and monotonicity of $\bm w^+(t)$ and $\bm w^-(t)$ by the following lemma assuming they satisfy \eqref{eq:weightdynmics}. Proofs of the results in this section can be found in Appendix~\ref{appendixA}.
\begin{proposition}\label{proposition:signandmonotone}
For each coordinate $i \in \{1, \dots, D\}$, 
\begin{itemize}
    \item $w_i^+(t)$ and $w_i^-(t)$ are always non-negative,
    \item if $w_i^+(0)' > 0$, then $w^+_i(t)' \geq 0$ and $w^-_i(t)' \leq 0$ for all $t$,
    \item if $w_i^+(0)' \leq 0$, then $w^+_i(t)' \leq 0$ and $w^-_i(t)' \geq 0$ for all $t$.
\end{itemize}
\end{proposition}

For each $i$, based on this proposition, either $w^+_i(t)$ or $w^-_i(t)$ is monotonically non-decreasing. We denote the dominating weight that is monotonically non-decreasing by $u_i$, and we denote the one that is non-increasing by $v_i$, i.e.,
\begin{align*}
    u_i(t) &:= \begin{cases}
        w_i^+(t) & \text{ if } w_i^+(0)' > 0, \\
        w_i^-(t) & \text{else},
    \end{cases}\\
    v_i(t) &:= \begin{cases}
        w_i^-(t) & \text{ if } w_i^+(0)' > 0, \\
        w_i^+(t) & \text{else}.
    \end{cases}
\end{align*}

A key identity in the derivation of the mirror map for GD is that $w^+_i(t)w^-_i(t) = \alpha^2$ holds throughout the dynamics. However, this quantity is not conserved when coordinate-wise adaptivity is applied. In fact, we can show that $w^+_i(t) w^-_i(t) < \alpha^2$ for $t > 0$. The adaptive learning rate ensures similar rate of change across all coordinates, and enables sufficient updates even when the gradient magnitude is relatively small. In particular, this allows the non-dominating weight $v_i(t)$ to diminish to negligible values early on. Based on this observation, we identify an initial stage of the dynamics where $v_i(t)$ decreases to and remains below a value on the order $\varepsilon$ across all coordinates. The following lemma also shows that this initial stage lasts no longer than $t=2\alpha$.
\begin{proposition}\label{proposition:initialstage}
There exists $T_0 \in (0, 2\alpha]$ such that for all $t \geq T_0$, $v_i(t) \leq \frac{2\varepsilon}{|x^{(n)}_i y^{(n)}|}$ for all $i$. 
\end{proposition}

The proof hinges on bounding the value of $v_i(t)$ from above when the gradient component $[\nabla_{\bm v}L(\bm w(t))]_i$ reaches $\varepsilon$ at $t=t_i$. Before $t_i$, the absolute value of the derivative $|v'_i(t)|$ is always greater than $\frac{1}{2}$, ensuring rapid decreasing of $v_i(t)$. Meanwhile, the non-negativity of $v_i(t)$ by Proposition~\ref{proposition:signandmonotone} guarantees that the rapid decreasing stage lasts no longer than $2\alpha$. Based on the expression $[\nabla_{\bm v}L(\bm w(t))]_i = v_i(t)|x^{(n)}_i r^{(n)}(t)|$, we continue to bound the residual $|r^{(n)}(t)|$ from below using the maximal growth of $u_i(t)$ during this short time period. Finally, the lower bound of $|r^{(n)}(t_i)|$ leads to the upper bound of $v_i(t_i)$ at $t_i$. We complete the proof by letting $T_0$ be the largest $t_i$ across all coordinate $i$. 

During the initial stage, both $\bm{u}(t)$ and $\bm{v}(t)$ follow sign descent approximately, which allows us to approximate the primal dynamics of $\bm\beta(t)$ by sign descent. After $T_0$, the dynamics of $\bm\beta(t)$ transition into the main stage, where $\bm v(t)$ remains small and the magnitude of $\bm \beta(t)$ is denominated by $\bm u(t)$. While the primal dynamics become complex, we formulate a mirror map so that the dual dynamics have a simplified structure that closely aligns with the sign of $\nabla_{\bm u} L(\bm w(t))$.
\begin{proposition}[Dual dynamics of smoothed sign descent]\label{proposition:dualdynamics}
    For $t > 0$, we define a potential function $\Phi_t(\bm\beta) = \frac{2}{3}\sum_{i=1}^{D} \left(|\beta_i| + v_{i, t}^2\right)^{\frac{3}{2}}$. The induced mirror map $\nabla \Phi_t: \mathbb{R}^D \to \mathbb{R}^D$ maps $\bm\beta(t)$ to the dual space. The dynamics in the dual space follow \begin{equation}\label{eq:dualdynamics}
        \frac{d\nabla \Phi_t (\bm{\beta}(t))}{dt} = -\operatorname{sgn}(\bm\beta(t)) \odot \frac{\nabla_{\bm{u}}L(\bm w(t))}{|\nabla_{\bm{u}}L(\bm w(t))| + \varepsilon\bm{1}}.
    \end{equation}
\end{proposition}

The potential function is time-varying with a time-dependent parameter $v_{i, t} := v_i(t)$. \citet{pesme2021implicit} also employ a time-varying potential function to construct a mirror map for the dynamics of stochastic GD. \citet{radhakrishnan2020linear} conduct a thorough analysis of the convergence of MD with time-dependent mirrors. For $t \geq T_0$, since the non-dominating weights $v_i(t)$ diminish to small values by Proposition~\ref{proposition:initialstage}, the potential function has a close connection with the $l_{3/2}$-norm of $\bm\beta(t)$, in contrast with the potential function \eqref{eq:GDpotential} for GD.

The dual dynamics \eqref{eq:dualdynamics} indeed reveal a greatly simplified structure compared to the primal dynamics \eqref{eq:betadynmics}. The right-hand side of the original dynamics \eqref{eq:betadynmics} evolves in a complex way in the $D$-dimensional space as the weights are updated, while the right-hand side of the dual dynamics \eqref{eq:dualdynamics} reduces to two components, a sign vector and a vector approximating the sign of the gradient. The simplified structure enables us to understand the complex dynamics \eqref{eq:betadynmics} by studying the evolution of the two sign vectors. However, the formulation of the dual dynamics \eqref{eq:dualdynamics} differs from standard MD dynamics \eqref{eq:MDdualdynamics} where the updates in the dual space align with the gradients exactly. The alignment has allowed previous work to show that the convergent solution satisfies the KKT conditions for Bregman divergence minimization as in \eqref{eq:GDconvergentsolution}. Therefore, further analysis of the dual dynamic \eqref{eq:dualdynamics} is required to understand the deviation from following the true gradients. 
\begin{proposition}\label{proposition:mainstage}
    There exists $T > T_0$ such that we can divide the dynamics into two stages: 
    \begin{itemize}
        \item Sign descent stage: for $t \in [T_0, T)$, $\left|\nabla_{\bm u} L(\bm w(t))\right|_i > \varepsilon$ for all $i$,
        \item Convergence stage: for $t \in [T, \infty)$, $\min_i \left|\nabla_{\bm u} L(\bm w(t))\right|_i \leq \varepsilon$.
    \end{itemize}
\end{proposition}
At the beginning, the dual dynamics resemble sign descent when gradient components are relatively large compared to $\varepsilon$. The stability constant comes into effect when $\left|\nabla_{\bm u} L(\bm w )\right|_i$ becomes small. In Proposition~\ref{proposition:mainstage}, we prove the transition between the two stages by studying the evolution of the magnitude of each gradient component. Importantly, Proposition~\ref{proposition:mainstage} shows that once a gradient value reaches $\varepsilon$, it remains small for the duration of the dynamics. The dynamics then enter a convergence stage with diminishing magnitude of gradients. Eventually, the dynamics approximate the direction of $\nabla_{\bm u} L(\bm w)$ as all gradient components approach zero (see Lemma~\ref{lemma:convergence} in Appendix~\ref{appendixA}). 

We illustrate the transition of the three stages in Figure~\ref{fig:threestages}. We randomly generate a dataset with $N=2$ and $D=5$ that satisfies Assumption~\ref{assumption:data} and set $\alpha=0.1$. We simulate the dynamics \eqref{eq:weightdynmics} using the ODE solver in \texttt{SciPy} and visualize the evolution of primal and dual variables. In the experiments, $T_0$ is calculated as the value when $\max_i |\nabla_{\bm v}L(\bm w(t))|_i$ first becomes $\varepsilon$, while $T$ is calculated as the value when $\min_i |\nabla_{\bm u}L(\bm w(t))|_i$ first becomes $\varepsilon$. Based on smoothed sign descent (see \eqref{eq:dualdynamics}) and Proposition~\ref{proposition:mainstage}, we expect the change to be linear in $[T_0, T]$, and incoherent behavior in $[T, \infty)$.
In the initial stage when $t  < T_0$, we observe that the primal variable has linear change across all coordinates. During the sign descent stage when $T_0 \leq t < T$, the dual variable continues growing linearly with approximately uniform rate in all coordinates, while $\bm\beta(t)$ no longer changes linearly. After $T$, the dynamics enter the convergence stage, where the primal and dual variables gradually approach the convergent point. We also observe that the value of $\varepsilon$ plays a key role in shaping the dynamics. For smaller $\varepsilon$, the dual variable follows the sign descent more closely and converges to values concentrated around two distinct points across all coordinates; while for larger $\varepsilon$, the dual variable shows greater dispersion across all coordinates. We quantify the relationship between the value of $\varepsilon$ and the convergent solution in the following analysis.


\begin{figure*}[ht]
    \centering
    \begin{subfigure}[b]{0.45\textwidth}
    \centering
    \includegraphics[width=\textwidth]{primal_dual_epsilon_0.002.png}
    \caption{$\varepsilon = 0.002$}
    \end{subfigure}
    \begin{subfigure}[b]{0.45\textwidth}
    \centering
    \includegraphics[width=\textwidth]{primal_dual_epsilon_0.01.png}
    \caption{$\varepsilon = 0.01$}
    \end{subfigure}
    \caption{Evolution of primal variable $\bm\beta(t)$ and dual variable $\nabla \Phi_t (\bm\beta(t))$ in $\mathbb{R}^5$ of smoothed sign descent with different values of stability constant $\varepsilon$. The vertical line $t=T_0$ marks the transition from initial stage to the sign descent stage, and the line $t=T$ marks the transition to the convergence stage.}
    \label{fig:threestages}
\end{figure*}

\subsubsection{Characterization of Convergent Solution by Bregman Divergence}\label{section:characterization}
The convergent solution of smoothed descent dynamics deviates from the exact KKT point of Bregman divergence minimization. However, we show that it satisfies the $\delta$-KKT conditions for a Bregman divergence style function. In this section, we build on the results about stage transitions and conduct an in-depth analysis to quantify and bound the error $\delta$. To emphasize the role of $\varepsilon$ in bounding the error, we impose an additional assumption on the block-diagonal structure from Assumption~\ref{assumption:data}.
\begin{assumption}\label{assumption:data2}
    We assume that each block $B^{(n)}$ of the block-diagonal matrix $X^\top X$ has size $D_n=2$.
\end{assumption}
The 2D block structure enables us to derive an explicit dependence of the bounds for $\delta$ on the stability constant $\varepsilon$, while keeping the overparameterization setting for smoothed sign descent. By the spectral theorem, we can write $X^\top X = Q \Lambda Q^\top$ for an orthogonal matrix $Q$ and a diagonal matrix $\Lambda$. The matrix $Q$ is block-diagonal, where each block is expressed as a 2D rotation matrix parameterized by $\theta_n$. We have
\begin{align}
    B^{(n)} = \begin{bmatrix}
        \cos\theta_n & -\sin\theta_n \\
        \sin\theta_n & \cos\theta_n 
    \end{bmatrix} \begin{bmatrix}
        \lambda_n & 0 \\
        0 & 0 
    \end{bmatrix}
    \begin{bmatrix}
        \cos\theta_n & \sin\theta_n \\
        -\sin\theta_n & \cos\theta_n 
    \end{bmatrix},
\end{align}
where $\lambda_n > 0$ and $\cos\theta_n, \sin\theta_n$ are non-zero by Assumption~\ref{assumption:data}. Without loss of generality, we assume that $|\cos\theta_n| \geq |\sin\theta_n|$, which can be achieved by ordering the columns of $X$. We first present the result for $N=1$ to illustrate the key findings and then generalize the results to $N > 1$. To this end, we show in Appendix~\ref{appendixA} that there exists $\bm v^{\infty} = \lim_{t\to \infty}\bm v(t)$ and we let $\Phi_{\infty}(\bm\beta) = \frac{2}{3}\sum_{i=1}^D\left(|\beta_i| + (v^\infty_{i})^2\right)^{\frac{3}{2}}$. By Proposition~\ref{proposition:initialstage}, we have $v_i^{\infty} = \mathcal{O}(\varepsilon)$ and $\Phi_{\infty}(\bm\beta) = \frac{2}{3}\sum_{i=1}^D |\beta_i|^{\frac{3}{2}} + \mathcal{O}(\varepsilon^2)$, i.e., approximately the ${\frac{3}{2}}$-th power of the $l_{3/2}$ norm.

We define a Bregman divergence style function $E$ associated with the potential function $\Phi$ for smoothed sign descent by
\begin{equation}
    E(\bm\beta, \bar{\bm\beta}) := \Phi_{\infty}(\bm\beta) - \Phi_0(\bar{\bm\beta}) + \langle \nabla\Phi_{0}(\bar{\bm\beta}), \bar{\bm\beta} - \bm\beta \rangle.
\end{equation}
We let $\bm\beta_0 := \bm\beta(0) = \bm{0}$ denote the starting point. Let us consider the constrained optimization problem: \begin{equation}
    \underset{{\bm\beta \in \mathbb{R}^D \text{ s.t. } X\bm\beta = \bm y}}{\min} E(\bm\beta, \bm\beta_0).\label{eq:Bregmanminimization}
\end{equation}
For $\delta \geq 0$, a solution $\bm\beta^{*}$ satisfies the $\delta$-KKT conditions for \eqref{eq:Bregmanminimization} if $X\bm\beta^* = \bm y$ and there exists a scalar $\nu$ such that $\Vert \nabla_{\bm\beta} E(\bm\beta^*) - \nu \nabla_{\bm\beta} (X\bm\beta^*)\Vert \leq \delta$.


\begin{theorem}\label{maintheorem}
        As $t \to \infty$, the regression parameter converges to an interpolating solution. We let $\bm\beta^{\infty} := \lim_{t \to \infty}{\bm\beta(t)}$, which exists by Lemma~\ref{lemma:convergence} in Appendix~\ref{appendixA}. We show that $\bm\beta^{\infty}$ satisfies the $\delta$-KKT conditions for \eqref{eq:Bregmanminimization} 
        with the error $\delta(\varepsilon)$ bounded by $\max\left\{|M_+|, |M_-|\right\}$, where
        \begin{align*}
            M_+ =~&\left(|\cos\theta_1|-|\sin\theta_1|\right)\lambda_1^{-\frac{1}{4}} |y^{(1)}|^{\frac{1}{2}} + \mathcal{O}(\varepsilon),\\
            M_- =~&\left(|\cos\theta_1|-|\sin\theta_1|\right)\left((2\lambda_1)^{-\frac{1}{4}}|y^{(1)}|^{\frac{1}{2}}-\alpha\right) \\
            &+ \mathcal{O}(\sqrt{\varepsilon}).
        \end{align*}
\end{theorem}

We present the main idea of the proof here and provide the full proof in Appendix~\ref{appendixB}. The exact expressions for $M_+$, $M_-$ can be found in the full proof. First, we observe the connection between the gradient of $E$ and the integral of the dual dynamics \eqref{eq:dualdynamics} with respect to $t$. The dual dynamics structure enables us to calculate the deviation $\delta$ from satisfying the stationary condition using the dominating weights $\bm u^\infty$. Next, using an orthogonal projection, we reduce the problem to bounding the absolute value of $\Delta := |\cos\theta_1|\left(u_2^{\infty}-u_2(0)\right) - |\sin\theta_1|\left(u_1^{\infty}-u_1(0)\right)$. To bound $\Delta$, we leverage the ratios between $u'_1(t)$ and $u'_2(t)$ in different stages of the dual dynamics, and focus on bounding the key quantity $u_2(T)$ at the transition between the two stages. During the sign descent stage, the leading terms of $u'_1(t)$ and $u'_2(t)$ are both $1$ in the Taylor expansion at $\varepsilon=0$, which guarantees a lower bound for $u_2(T)$. Being in the convergence stage, $u_1(t)$ dominates the growth, which allows us to derive an upper bound for $u_2^\infty$. Finally, a lower bound for $u_2^{\infty}$ leads to $\Delta \geq M_-$, while an upper bound leads to $\Delta \leq M_+$. 

The derivation relies on the key quantity of $u_2(T)$ at the stage transition when the smallest gradient component reaches $\varepsilon$. The value of $\varepsilon$ is crucial in determining the stage transition and it eventually affects the convergent solution. We further reveal the relationship between $\varepsilon$ and the upper bound of $\delta$ in the following corollary. We provide the proof in Appendix~\ref{appendixB}.
\begin{corollary}\label{corollary:epsilon}
    We let $\mathcal{I}_\varepsilon$ be the range of $\varepsilon$ implied by Assumption~\ref{assumption:epsilonalpha}. There exists a non-degenerate interval $\mathcal{I}' \subseteq \mathcal{I}_\varepsilon$ such that for all $\varepsilon \in \mathcal{I}'$, \begin{equation}
        \delta(\varepsilon) \leq \bar{M} - \left( |\cos\theta_1| - |\sin\theta_1| \right)\frac{\sqrt{2}\varepsilon}{4\lambda_1^\frac{1}{2}|y^{(1)}|},
    \end{equation}
    where $\bar{M} := \left( |\cos\theta_1| - |\sin\theta_1| \right)\lambda_1^{-\frac{1}{4}}|y^{(1)}|^\frac{1}{2}$ is a quantity independent of $\varepsilon$.
\end{corollary}
The result highlights the role of $\varepsilon$ in bounding the KKT error. Given a fixed dataset, while setting $\varepsilon=0$ ensures a larger rate of change when the gradient magnitude becomes very small, the error is larger than that for smoothed sign descent with non-zero $\varepsilon$. Moreover, choosing a larger $\varepsilon$ within a certain interval effectively shrinks the upper bound on the KKT error $\delta$. It suggests that by using a proper value of $\varepsilon$, the dynamics can converge to a solution closer to the point with the $E$ minimization property. Therefore, our result provides a theoretical ground for the benefit of tuning $\varepsilon$ versus using a small default value for adaptive gradient methods.

Approximate KKT points are formally studied in \citep{andreani2011sequential}, which shows that when the KKT conditions are satisfied approximately, the point is close to solving the optimization problem. This concept is commonly used in practice such as in numerically solving an optimization problem. The iterative process is terminated after finding a solution satisfying approximate KKT conditions under a given tolerance of error $\delta$, which can be justified by the approximate optimality of these points. Therefore, by showing a bound for the $\delta$ error, we establish the approximate optimality of the convergent solution.

To visualize the convergent solutions for different values of $\varepsilon$, we plot the trajectory of $\bm\beta(t)$ using randomly generated data with $N=1$ and $D=2$ in Figure~\ref{fig:comparison-1}. We note that as $\varepsilon$ becomes larger, the convergent solution is closer to the solution with the minimal value of $E$ to the initial point among all solutions. We also compute the value of $E$ to the initial point for convergent solutions using different $\varepsilon$ and plot the trend in Figure~\ref{fig:comparison-2}. The plot confirms that for larger $\varepsilon$, the convergent solutions have smaller values of $E(\bm\beta^\infty, \bm\beta_0)$.

\begin{figure}[ht]
    \centering
    \includegraphics[width=0.8\linewidth]{comparison_of_solutions-1006.png}
    \caption{Trajectories of $\bm\beta(t)$ in $\mathbb{R}^2$ for different values of stability constant $\varepsilon$.}
    \label{fig:comparison-1}
\end{figure}

\begin{figure}[ht]
    \centering
    \includegraphics[width=0.79\linewidth]{Bregman_div_vs_epsilon-1006.png}
    \caption{Bregman divergence style function value $E(\bm\beta^\infty, \bm\beta_0)$ of convergent solutions with different values of stability constant $\varepsilon$.}
    \label{fig:comparison-2}
\end{figure}

\paragraph{Extension to $N > 1$.} We generalize the results to the case when $N > 1$ in the following corollaries. The proofs can be found in Appendix~\ref{appendixB}. We show that the convergent solution satisfies approximate KKT conditions of minimizing $E(\bm\beta, \bm\beta_0)$ among all solutions. Within a certain interval, a larger value of $\varepsilon$ leads to a greater reduction of the KKT error. The implications for tuning the stability constant $\varepsilon$ still hold.
\begin{corollary}\label{corollary:NDcorollary}
    For $N > 1$, let us suppose Assumption~\ref{assumption:data2} is satisfied. As $t \to \infty$, the regression parameter converges to an interpolating solution $\bm\beta^\infty$ that satisfies the $\bar{\delta}$-KKT conditions for \eqref{eq:Bregmanminimization}
    with the error $\bar{\delta}(\varepsilon)$ bounded by $\sum_{n=1}^N \max\left\{\Big|M^{(n)}_+\Big|,~\Big|M_-^{(n)}\Big|\right\}$, where
    \begin{align*}
            M^{(n)}_+ =~&\left(|\cos\theta_n|-|\sin\theta_n|\right)\lambda_n^{-\frac{1}{4}} |y^{(n)}|^{\frac{1}{2}} + \mathcal{O}(\varepsilon),\\
            M^{(n)}_- =~&\left(|\cos\theta_n|-|\sin\theta_n|\right)\left((2\lambda_n)^{-\frac{1}{4}}|y^{(n)}|^{\frac{1}{2}}-\alpha\right) \\
            &+ \mathcal{O}(\sqrt{\varepsilon}).
        \end{align*}
\end{corollary}

\begin{corollary}\label{corollary:NDvarepsilon}
There exists a non-degenerate interval $\mathcal{J} \subseteq \mathcal{I}_\varepsilon$ such that for all $\varepsilon \in \mathcal{J}$, \begin{align}
        \bar{\delta}(\varepsilon) \leq &\sum_{n=1}^N \left( |\cos\theta_n| - |\sin\theta_n| \right) \left(\lambda_n^{-\frac{1}{4}}|y^{(n)}|^\frac{1}{2}\right)  - \\&\left(\sum_{n=1}^N\left( |\cos\theta_n| - |\sin\theta_n| \right)\frac{\sqrt{2}}{4\lambda_n^\frac{1}{2}|y^{(n)}|}\right)\varepsilon.
    \end{align}
\end{corollary}
In overparameterized regression problems, the dimension $D$ is larger than the number of examples $N$, leading to infinitely many solutions. Our results establish that the solution found by smoothed sign descent approximately minimizes a measure of distance to the initial point related to the $l_{3/2}$-norm for quadratic parameterized models, and the error only scales with $N$.

\paragraph{Extension to Higher Order Models.} Our analysis is generalizable to parameterizations with higher order $H \geq 2$ in weights, given by $\bm{\beta} = \bm{u}^H - \bm{v}^H$. Here $s^H$ denotes applying Hadamard product $H$ times on vector $s$. This parameterization can be interpreted as a diagonal linear neural network of depth $H$, as explained in \citep{woodworth2020kernel}. The mirror map is induced by a potential function closely related to $l_{2 - \frac{1}{H}}$-norm of $\bm\beta$, given by $\Phi^H_{t}(\bm\beta) := \sum_{i=1}^D \left(|\beta_i| + v^H_{i, t}\right)^{2-\frac{1}{H}}$, where $v_{i, t} = \mathcal{O}(\varepsilon)$. When the depth $H \to \infty$, the potential function approximates the squared $l_2$-norm.

\section{CONCLUSION}
In this work, we propose an MD perspective of the dynamics of smoothed sign descent for overparameterized regression problems. We extend existing results beyond GD to a case where update directions deviate from true gradients due to adaptivity, and formulate the equivalent dual dynamics with a simplified structure. We also study the role of the stability constant $\varepsilon$ in bounding the deviation of the convergent solution from minimizing a Bregman divergence style function. The finding supports the benefit of tuning the stability constant $\varepsilon$. Future work may extend our analysis to widely used methods such as Adam, RMSProp, and AdaGrad. With additional approximations used for adapting the learning rates, further investigation is needed to understand the transition among the three stages and to analyze the impact of the stability constant on the convergent solution.


