\section{Introduction}
This paper considers the following non-convex optimization problem under the stochastic approximation framework~\citep{robbins1951stochastic},
\begin{equation}
    \label{eq:expectation}
    \min_{\bx \in \bbR^n} F(\bx) \coloneqq \EE_{\xi \sim \mathcal{D}} \left[ f(\bx, \xi) \right],
\end{equation}
where $\xi$ is the random variable sampled from a distribution $\mathcal{D}$, and $n$ is the dimension of the decision variable $\bx$. The above framework encompasses a wide range of problems, ranging from the offline setting, wherein the objective function is minimized over a pre-determined number of samples, to the online setting, where the samples are drawn sequentially from the same distribution.

Since $F(\cdot)$ is non-convex,
finding its global optimum is NP-hard in general \citep{pardalos_quadratic_1991, hillar2013most}. Therefore, a practical goal is to find a \textit{stationary point}\footnote{A stationary point is a point $\bx$ satisfying $\|\nabla F(\bx)\| \leq \mathcal{O}(\epsilon)$.}. From the perspective of sample complexity, such a goal can be achieved using $\mathcal{O}(\epsilon^{-3})$ stochastic gradient and Hessian-vector product. Surprisingly, the result cannot be improved using any stochastic $p$-th order methods for $p \geq 2$ with standard Lipschitz continuity assumptions \citep{arjevani_second-order_2020}. From this point of view, the benefits of second-order information seem limited.

In real life, however, problem \cref{eq:expectation} usually bears some structures that enable global convergence and fast convergence rate, such as weakly strong convexity~\citep{necoara2019linear}, error bounds~\citep{luo1993error}, and restricted secant inequality~\citep{zhang2013gradient}. 
Among these conditions, it is shown in \cite{karimi2016linear} that the Polyak-\L{}ojasiewicz (P\L{}) condition~\citep{polyak1963gradient} is generally weaker, which can be covered by the \emph{gradient dominance property} studied in this paper. Informally, we say function $F(\cdot)$ is gradient-dominated with parameter $\alpha$ if there exists a constant $C > 0$ such that $F(x) - F(x^*) \leq C \|\nabla F(x)\|^{\alpha}$, where $x^* \in \arg\min_{x} F(x)$. When $\alpha = 2$, it reduces to the P\L{} condition. This property also enjoys many important applications in different fields, including generalized linear models~\citep{foster2018uniform}, ResNet with linear activation~\citep{hardt2016identity}, operations management~\citep{fatkhullin2022sharp}. 
Moreover, \cite{agarwal2021theory} show that a weaker version of gradient dominance property with $\alpha  = 1$ is also satisfied in policy-based reinforcement learning (RL).

In this paper, we study the sample complexity required by ensuring \textit{global optimality} in non-convex stochastic optimization with the gradient dominance property. Specifically, we are interested in the number of queries of stochastic gradient and Hessian along the iterations until reaching a point $x$ that satisfies $\mathbb{E} [ F(x) ] - F(x^*) \leq \epsilon$.

Different from Lipschitz continuity assumptions, gradient dominance property witnesses an improved sample complexity (or iteration complexity) for both stochastic and deterministic second-order methods \citep{nesterov2006cubic,chayti2023unified,masiha2022stochastic} when compared to the first-order methods \citep{nguyen2019tight,fontaine2021convergence}. 
Nevertheless, a common drawback of these second-order methods lies in their dependence on the expensive $\mathcal{O}(n^3)$ computational cost at each iteration to obtain an approximate solution to an inevitable \textit{cubic-regularized subproblem}\footnote{In our discussion, we solve linear systems via computing the matrix inverse directly, or the direct method. Although there are some indirect methods for solving linear systems, they either have worse dependence on condition number, or have the same computational cost as our proposed method.}. Recently, \cite{zhang2022homogenous} develop a novel homogeneous second-order method (HSODM), which only needs to find the leftmost eigenvector of an augmented matrix at each iteration with $\tilde{\mathcal{O}}(n^2)$\footnote{In this paper, we use $\tilde{\mathcal{O}}(\cdot)$ to ignore the logarithmic factors.} computational cost \citep{kuczynski_estimating_1992}. They prove that HSODM not only reduces $\mathcal{O}(n^3)$ computational burden required by cubic-regularization, but also achieves the optimal iteration complexity \citep{carmon2021lower}.

Hence, a natural question arises:
\begin{quote}
    \textit{Can the homogenization approach be extended to gradient-dominated stochastic optimization, and maintain the state-of-the-art sample complexity?}
\end{quote}
In this paper, we give an affirmative answer to the question above. Our contributions are summarized as follows:
\begin{enumerate}

    \item First, we propose two non-trivial customized strategies to extend HSODM from non-convex optimization to gradient-dominated optimization. The success of HSODM is attributed to its fixed choice of the last diagonal element of the augmented matrix, which can not directly apply to gradient-dominated optimization (we will explain it in \cref{section:two-strategies}). To tackle this challenge, we develop a perturbation strategy and design a novel parameter-searching strategy to construct the augmented matrix, both of which significantly differs from HSODM, and may be of independent interest to extend HSODM to convex optimization.
    
    \item Second, we propose a variant of HSODM for gradient-dominated stochastic optimization, SHSODM, and analyze its sample complexity for stochastic function enjoying gradient dominance property with $\alpha \in [1,2]$. When $\alpha \in [1,3/2)$, we further provide an enhanced result by employing variance reduction techniques. Specifically, the sample complexity of SHSODM can be improved to $ \mathcal{O}\left( \epsilon ^{- 2/\alpha}\right)$. Our results match the best-known sample complexity in the literature obtained by the stochastic cubic-regularized Newton method (SCRN, \citealt{masiha2022stochastic}). For clarity, we give the detailed results\footnote{In \cite{masiha2022stochastic}, they do not give the sample complexity of SCRN with variance reduction explicitly when $\alpha \in [1, 3/2)$. However, by the similar technique they use when $\alpha=1$, we derive the corresponding result and present it here.} in \cref{table:comparsion}.
    
    \item Finally, SHSODM only requires solving an eigenvalue problem at each iteration, hence overcoming the heavy computational burden of second-order methods. Empirically, we test SHSODM in the context of RL, whose objective function enjoys gradient dominance property with $\alpha = 1$, and compare its performance with SCRN and other standard RL algorithms. The numerical experiments demonstrate that SHSODM is superior to these methods and immune to ill-conditioning.
\end{enumerate}
\begin{table*}[htbp]
    \caption{Sample complexity of different algorithms for gradient-dominated stochastic optimization. The third to sixth columns present the sample complexity, per iteration cost, whether the algorithm needs to solve linear systems at each iteration, and whether it matches the best-known result, respectively. Note that SGD represents stochastic gradient descent method, the prefix ``VR'' stands for the variance reduction version, and ``w.h.p.'' stands for ``with high probability''.}
    \label{table:comparsion}
    \centering
    \small
    \begin{tabular}{lclccc}
        \hline
        Algorithm         & $\alpha $                      & Sample Complexity                  & Per Iteration Cost & Free of Linear System & Best-known   \\
        \hline
        SGD   &   $ \multirow{5}{*}{$[1, 3/2)$} $   & $\mathcal{O}(\epsilon^{ -4/\alpha + 1 })$    & $\mathcal{O}(n)$             & \Checkmark              & \XSolidBrush \\
        SCRN  &    & $\mathcal{O}(\epsilon^{-7/(2\alpha) + 1 })$  & $\mathcal{O}(n^3)$           & \XSolidBrush            & \XSolidBrush   \\
        VR-SCRN  &                & $\mathcal{O}(\epsilon^{-2/\alpha})$  & $\mathcal{O}(n^3)$           & \XSolidBrush            & \Checkmark   \\
        \textbf{SHSODM} &    & $\mathcal{O}(\epsilon^{-7/(2\alpha) + 1 })$  & $\tilde{\mathcal{O}}(n^2  )$ & \Checkmark              & \XSolidBrush   \\
        \textbf{VR-SHSODM} &     & $\mathcal{O}(\epsilon^{-2/\alpha})$  & $\tilde{\mathcal{O}}(n^2  )$ & \Checkmark              & \Checkmark   \\
        \hline
        SGD   & $ \multirow{3}{*}{$3/2$}$      & $\mathcal{O}(\epsilon^{ - 5 /3  })$          & $\mathcal{O}(n)$             & \Checkmark              & \XSolidBrush \\
        SCRN  &                                & $ \mathcal{O}( \epsilon^{ -4/3 }\log(1/\epsilon) ) $ & $\mathcal{O}(n^3)$           & \XSolidBrush            & \Checkmark   \\
        \textbf{SHSODM} &                                & $ \mathcal{O}( \epsilon^{ -4/3 }\log(1/\epsilon) )$   & $\tilde{\mathcal{O}}( n^2 )$ & \Checkmark              & \Checkmark   \\
        \hline
        SGD   & $ \multirow{3}{*}{$(3/2, 2]$}$ & $\mathcal{O}(\epsilon^{ -4/\alpha + 1 })$    & $\mathcal{O}(n)$             & \Checkmark              & \XSolidBrush \\
        SCRN  &                                & $\mathcal{O}(\epsilon^{-2/\alpha}\log\log(1/\epsilon))~\text{w.h.p.}$ & $\mathcal{O}(n^3)$           & \XSolidBrush            & \Checkmark   \\
        \textbf{SHSODM} &                                & $\mathcal{O}(\epsilon^{-2/\alpha}\log\log(1/\epsilon))$  & $ \tilde{\mathcal{O}}(n^2) $ & \Checkmark              & \Checkmark   \\
        \hline
    \end{tabular}
\end{table*}

The rest of the paper is organized in the following manner. In the remainder of the section, we review some related literature. \cref{section:prelimi} gives a formal definition of gradient dominance property considered in this paper, and a brief introduction to HSODM. In \cref{section:two-strategies}, we propose two nontrivial customized strategies to extend HSODM to gradient-dominated optimization. In \cref{subsec:shsodm}, we formally describe our SHSODM and prove its sample complexity, which is improved later in \cref{subsec:sto-hsodm-vr} by adapting the variance reduction technique. \cref{section:numerical} provides the numerical experiments in RL, and demonstrates the superior performance of SHSODM to SCRN and other widely-used RL algorithms. Finally, \cref{section:conclusion} concludes the paper and presents several future research directions.

\subsection{Related Work}
We review some papers studying gradient dominance property and second-order methods for gradient-dominated optimization in both deterministic and stochastic settings. The more comprehensive literature review including first-order methods is provided in the Appendix.

\textbf{Gradient-dominated optimization and its applications.} The gradient dominance property with $\alpha = 2$ (or the P\L{} condition) is first introduced in \citet{polyak1963gradient}. It is strictly weaker than strong convexity which is sufficient to guarantee the global linear convergence rate for the first-order methods. \citet{karimi2016linear} further show that the P\L{} condition is weaker than most of the global optimality conditions that appeared in the machine learning community. The gradient dominance property is also established locally or globally under some mild assumptions for problems such as phase retrieval~\citep{zhou2017characterization}, blind deconvolution~\citep{li2019rapid}, neural network with one hidden layer~\citep{li2017convergence,zhou2017characterization}, linear residual neural networks~\citep{hardt2016identity}, and generalized linear model and robust regression~\citep{foster2018uniform}. Meanwhile, it is worth noting that in policy-based RL, a weak version of gradient dominance property with $\alpha = 1$ holds for some certain classes of policies, such as the Gaussian policy~\citep{yuan2022general}.

\textbf{Second-order methods.} The original analysis for second-order methods under gradient dominance property appears in \citet{nesterov2006cubic}. They focus on the cubic-regularized Newton method (CRN) for $\alpha \in \{1, 2\}$. When $\alpha = 2$, they show that the algorithm has a superlinear convergence rate. When $\alpha = 1$, they prove that CRN has a two-phase pattern of convergence. The initial phase terminates superlinearly, while the second phase achieves an iteration complexity of $\mathcal{O}({\epsilon}^{-1/2})$. Afterward, the result in \citet{zhou2018convergence} gives a more fine-grained analysis of CRN for some functions satisfying K\L{} property (it covers the gradient dominance property except for the case $\alpha = 1$) by partitioning the interval $(1,2]$ into $(1, 3/2) $, $\{3/2\} $, and $ (3/2, 2] $, which enjoy sublinear, linear, and superlinear convergence rate, respectively. When it comes to the stochastic setting, \cite{masiha2022stochastic} obtain the sample complexity of SCRN for gradient-dominated stochastic optimization, which is the best-known result under this setting. Recently, \cite{chayti2023unified} consider the finite-sum setting. Nevertheless, all these papers rely on approximate solutions of cubic-regularized sub-problems.



\section{Preliminaries}
\label{section:prelimi}
In this section, we provide the preliminaries of our paper and introduce the novel second-order method, HSODM. First, We formally define the gradient dominance property.
\begin{assumption}[Gradient Dominance]\label{asp:gd_wdom}
    We say function $F(\bx)$ has the weak gradient dominance property with $ \alpha \in [1, 2]$, if there exist $C_{\text{weak}}>0$ and $\epsilon_{\text{weak}} > 0$ such that for all $ \bx \in \mathbb{R}^n$, it holds that
    \begin{equation}\label{eq:wgd}
        F(\bx) - F(\bx^*) \leq C_{\text{weak}} \|\nabla F(\bx)\|^\alpha + \epsilon_{\text{weak}}
    \end{equation}
    where $\bx^* \in \arg\min F(\bx)$.
    
    Furthermore, if $\epsilon_{\text{weak}} = 0$, we say function $F(\bx)$ has the strong gradient dominance property, that is,
    \begin{equation}\label{eq:gd}
        F(\bx) - F(\bx^*) \leq C_{\text{gd}} \|\nabla F(\bx)\|^\alpha
    \end{equation}
\end{assumption}
In this paper, we consider the gradient-dominated function with $\alpha \in [1,2]$. Note again when $\alpha = 2$, \cref{eq:gd} is the P\L{} condition. If $\alpha = 1$, \cref{eq:wgd} relates to the gradient dominance property discussed in policy-based RL. We also refer readers to \citet{fatkhullin2022sharp} for more concrete examples. 

% The gradient dominance property holds for a bunch of functions, and not limited to the applications we mentioned before. 

% In the following, we introduce the weak gradient dominance assumption, which has wide applications in the RL when $\alpha = 1$~\cite{}.

% \begin{assumption}\label{assum:common}
%     Suppose that the function is bounded from below and coercive.  
% \end{assumption}
% \begin{assumption}\label{assum:smooth}
%     The gradient of $ f $ is $ L-$Lipschitz continuous. 
% \end{assumption}
% \begin{assumption}\label{asp:h_l}
%     The Hessian of $ f $ is $ L-$Lipschitz continuous. 
% \end{assumption}


\textbf{A Brief Overview of HSODM.} We now introduce the framework of HSODM. The key ingredient of HSODM is the homogenized quadratic model (HQM) constructed at each iterate $\bx_k, k=1, 2, \ldots K$. At the $k$-th iteration, it builds a gradient-Hessian augmented matrix $A_k$ and solves a homogeneous quadratic optimization problem. Specifically, the following optimization problem is considered:
\begin{equation}
    \label{eq:homo-model}
    \begin{split}
        \min_{\|[\bv; t]\| \leq 1}
        \begin{bmatrix}
            \bv \\ t
        \end{bmatrix}^T
        \begin{bmatrix}
            \bH_k   & \bg_k   \\
            \bg_k^T & -\delta \\
        \end{bmatrix}
        \begin{bmatrix}
            \bv \\ t
        \end{bmatrix},
    \end{split}
\end{equation}
where $\bg_k$ is the gradient and $ \bH_k  $ is the Hessian. For ease of exposition, we define $A_k = [\bH_k,\bg_k; g_k^T,-\delta]$ and let $[ \bv_k; t_k]$ be the optimal solution to the above problem \cref{eq:homo-model}. In the next lemma, we characterize the optimal solution $[v_k; t_k]$.
\begin{lemma}[\citeauthor{zhang2022homogenous}, \citeyear{zhang2022homogenous}]
    \label{lemma:opt-cond-homo}
    Denote by $[\bv_k;t_k]$ the optimal solution to problem \cref{eq:homo-model}. We have:
    \begin{enumerate}
        \item[(1)]  There exists a dual variable $\theta_k \geq 0$ such that
            \begin{align}
                \label{eq.homoeig.soc}
            & \begin{bmatrix}
                \bH_k + \theta_k \cdot \bI & \bg_k            \\
                \bg_k^T                    & -\delta+\theta_k
            \end{bmatrix} \succeq 0, \\
            \label{eq.homoeig.foc}
            & \begin{bmatrix}
                \bH_k + \theta_k \cdot \bI & \bg_k            \\
                \bg_k^T                    & -\delta+\theta_k
            \end{bmatrix}
                \begin{bmatrix}
                    \bv_k \\ t_k
                \end{bmatrix} = 0,   \\
                \label{eq.homoeig.norm one} & \theta_k \cdot ( \|[\bv_k; t_k]\| - 1 ) = 0.
            \end{align}
            Moreover, $ -\theta_k = \lambda_{\min}(A_k)$.
        \item[(2)] If $t_k \neq 0$, then it holds that    \begin{equation}\label{eq.homoeig.foc t neq 0}
                    \bg_k^T \bd_k = \delta -\theta_k,~~ (\bH_k+\theta_k \cdot \bI)\bd_k =-\bg_k,
                \end{equation}
                where $\bd_k =\bv_k / t_k$.
        \item[(3)] If $t_k = 0$, $-\theta_k$ is the smallest eigenvalue of $H_k$ and $ g_k^Tv_k = 0 $.
    \end{enumerate}
\end{lemma}
\cref{lemma:opt-cond-homo} states that, if one sets $\delta \ge 0$ for the non-convex function, the augmented matrix $A_k$ must be negative definite. Meanwhile, the negative optimal dual solution, i.e., $-\theta_k$, is the smallest eigenvalue of $A_k$ and the optimal primal solution $[\bv_k, t_k]$ is its associated eigenvector. Moreover, when $t_k \neq 0$, the constructed direction $d_k$ is a descent direction by \cref{eq.homoeig.foc t neq 0}. Hence, one can update the next iterate by $\bx_{k+1} = \bx_k + \eta_k d_k$, where $\eta_k > 0$ is the stepsize. The convergence analysis of HSODM under the deterministic non-convex setting heavily depends on the fixed choice of $\delta = \Theta(\sqrt{\epsilon})$. However, this strategy for choosing $\delta$ cannot directly apply to gradient-dominated stochastic optimization, which will be discussed in detail in \cref{section:two-strategies}.

From the perspective of computational complexity, solving \cref{eq:homo-model} is essentially solving an extreme eigenvalue problem with respect to the augmented matrix $A_k$, since it has been shown by \cref{eq.homoeig.norm one} that $[ \bv_k; t_k]$ always attains the boundary of the unit ball. In view of this, the subproblems can be solved within the time complexity of $\tilde{\mathcal{O}}(n^2)$ \citep{kuczynski_estimating_1992}, enjoying cheaper computation than that in classical CRN~\citep{nesterov2006cubic} and its stochastic counterpart, where $\mathcal{O}(n^3)$ arithmetic operations are unavoidable. Similar to any second-order methods, HSODM also benefits from Hessian-vector product (HVP), where the Hessian matrix itself is unnecessary to be stored.
\begin{remark}
    To solve \cref{eq:homo-model}, one can apply some Lanczos-type algorithms \citep{kuczynski_estimating_1992}, which only need access to the oracle $A_k [v;t]$ for any given $[v;t]$. To achieve this, one can compute it by
    \begin{equation*}
       A_k [v;t] = \begin{bmatrix}
            \bH_k   & \bg_k   \\
            \bg_k^T & -\delta_k \\
        \end{bmatrix}
        \begin{bmatrix}
            \bv \\ t
        \end{bmatrix} =  \begin{bmatrix}
            \bH_k \bv + t \bg_k    \\
            \bg_k^T \bv - t \delta_k \\
        \end{bmatrix}.
    \end{equation*}
    Hence, only HVP $H_kv$ is needed.
\end{remark}

\section{Customized Strategies}
\label{section:two-strategies}
In this section, we introduce two customized strategies to extend HSODM to the gradient-dominated world. As HSODM is originally designed for non-convex optimization, it cannot directly apply to gradient-dominated functions, which also envelop convex functions with a bounded set. When the fixed choice rule of $\delta$ is used, we cannot connect the decrease in function value to the gradient. Consequently, HSODM only attains an unsatisfactory convergence rate, and several refinements must be adopted to fit our purpose. 

Inspired by the analysis of CRN \citep{nesterov2006cubic}, we need to ensure that $\lambda_k$ and $\|\dk\|$ have the same order and diminish simultaneously when the algorithm proceeds, where $\lambda_k = \lambda_{\min}(A_k)$. In other words, we need to approximately solve
\begin{equation}\label{eq:ls}
    h (\delta_k) = \lambda_k - C_e \| \dk \| = 0,
\end{equation}
where $C_e>0$ is a pre-specified constant. However, this desired relationship cannot be guaranteed by the fixed strategy to choose $\delta$ employed by HSODM. To address this challenge, we propose to adaptively choose $\delta_k$ at each iteration $k$ by a nontrivial line-search procedure.

Before delving into the analysis, we first parameterize the augmented matrix $A_k$ via $\delta_k$. Specifically, we let 
\begin{equation*}
    A_k( \delta_k ) = \begin{bmatrix}
        \Hk              & \gk       \\
        \gk^T & -\delta_k
    \end{bmatrix}. 
\end{equation*}
Thus, $\lambda_k$ and $d_k$ can also be seen as functions of $\delta_k$ (recall that $[v_k; t_k]$ is the leftmost eigenvalue of $A_k(\delta_k)$):
$$\dk(\delta_k) := v_k / t_k, \lambda_k (\delta_k ) = \lambda_{\min}(A_k(\delta_k)).$$
Therefore, we are now able to adjust $\delta_k$ of $A_k$ to find a better descent direction such that $\lambda_k$ and $\|\dk\|$ have the same order.
\begin{algorithm}[!htb]
% \small
\SetAlgoLined
\caption{Linesearch}
\label{algo:ls}
\KwIn{Current iterate $\xk$, $\Hk,\gk$, tolerance $\epsilon_{\text{ls}},\epsilon_{\text{eig}}$, initial search interval $ [\delta_l,\delta_r]$, ratio $ C_e $.}
Call \cref{algo:per} with $g_k, H_k, \epsilon_{\text{eig}}$ and collect $\gk'$\;
\For{$j=1,...,J_k$}{
Let $ \delta_m = (\delta_l + \delta_r) / 2 $ \;
Construct $A_k(\delta_k) := [\Hk, \gk'; (\gk')^T, \delta_m]$\;
Calculate the leftmost eigenpair $ \left( \lambda_k, [v_k;t_k] \right) $ of $A_k(\delta_m)$\;
Calculate $d_k := v_k / t_k$\;
\eIf{ $ C_e \left\| d_k \right\| \leq  |\lambda _k| $ }{
$ \delta_l \leftarrow \delta_m $\; 
}{
    $ \delta_r \leftarrow \delta_m $\; 
    }
\eIf{$|\delta_r-\delta_l| < \epsilon_{\text{ls}}$}{
\Return{$ \delta_m, d_k $}\;
    }
{
    $j \leftarrow j+1$;    
}
}
\end{algorithm}

In \cref{algo:ls}, we provide the linesearch procedure to adaptively adjust $\delta_k$ at each iteration $k$. In particular, we use binary search to find an appropriate $\delta_k$. We terminate the procedure if the search interval is sufficiently small, and conclude that the dual variable $\lambda_k$ approximately has the order of $\|d_k\|$.
However, the caveat of the above procedure is that a degenerate solution may exist if $g_k$ is orthogonal to the leftmost eigenspace $\mathcal S_{\min}$ of $\Hk$. In this case, the eigenvector provides no information about the gradient and (\ref{eq:ls}) may not have a solution. Such a cumbersome case is often regarded as a ``hard case'' in the literature of trust-region methods \citep{conn2000trust}. To overcome this obstacle, we use a random perturbation over $\gk$, through which the perturbed gradient $g_k'$ is no longer orthogonal to the minimal eigenvalue space $\mathcal{S}_{\min}$, i.e., $\mathcal{P}_{\mathcal S_{\min}}(g_k') \geq \epsilon_\text{eig}$, where $\epsilon_\text{eig}$ is the pre-determined tolerance and $\mathcal{P}_{\mathcal S_{\min}}(\cdot)$ represents the projection of a given vector onto $\mathcal{S}_{\min}$. 

We prove in the Appendix that, with the linesearch strategy after perturbation, an approximate solution to \cref{eq:ls} can always be found due to that $\lambda_k$ is continuous over $\delta_k$. Interestingly, such perturbation strategy is ``one-shot'' if needed, whose details are presented in \cref{algo:per}. For cases where $\mathcal S_{\min}$ consists of multiple eigenvectors, one can simply compute the inner product of $\gk$ and any $v\in \mathcal S_{\min}$, and adopt the perturbation if it is not sufficiently bounded away from zero. Besides, it is known that power method also applies in such scenarios, since only the projection is needed \citep{golub_matrix_2013}. The finite-step termination property of \cref{algo:ls} is guaranteed by the following theorem.
\begin{algorithm}[!htb]
\SetAlgoLined
\caption{A Perturbation Strategy}
\label{algo:per}
% \small
\KwIn{Current iterate $\xk$, $\Hk,\gk$, tolerance $\epsilon_{\text{eig}} $.}
% Calculate the leftmost eigenpair $\lambda_k, [v_k;t_k]$ of $A_k(0)$\;
Compute the projection of $ g $ to the minimal eigenvalue space $\mathcal{P}_{\mathcal S_{\min}}( g) $ \;
\eIf{$ \Vert \mathcal{P}_{\mathcal S_{\min}}( g)\Vert \geqslant \epsilon_{\text{eig}}$}{
\Return $g$\;
}{
    $ g' \leftarrow g+\epsilon _{\text{eig}} \cdotp \mathcal{P}_{\mathcal S_{\min}}( g) / \Vert \mathcal{P}_{\mathcal S_{\min}}( g)\Vert $\;
    \Return{$g'$}\;
    }
\end{algorithm}

% By at most one inquiry of this method, we eliminate the undesired case. Moreover, both the construction of direction $\dk$ and the line search over $\delta_k$ are well-defined, since $\lambda_k$ is continuous over $\delta_k$ as proved in Appendix.
 
\begin{theorem}[Finite-step Termination of \cref{algo:ls}] \label{lemma:ls_err}
\cref{algo:ls} terminates in $\mathcal{O}(\log( 1 /\epsilon _{\text{ls}} \epsilon _{\text{eig}}))$ steps. Furthermore, it produces an estimate $\hat{\delta }_{C_e}$ of $\delta_k$ and the direction $\dk$ such that $|h(\hat{\delta }_{C_e}) |\leq \epsilon _{\text{ls}}$, where $\epsilon _{\text{ls}} > 0$ is the tolerance.
\end{theorem}

The above theorem shows that the number of iterations $J_k$ for \cref{algo:ls} is at most $\mathcal{O}(\log( 1/\epsilon _{\text{ls}} \epsilon _{\text{eig}}))$. The extra invoking of \cref{algo:per} is only necessary if we find $\gk \perp \mathcal S_{\min}$. Since the computational complexity of \cref{algo:per} is within $\tilde{\mathcal{O}}(n^2 \epsilon^{-1/4})$ arithmetic operations and consistent with the computational cost of \cref{algo:ls}, we still harbor an advantage of cheap computational cost compared to solving cubic-regularized subproblems.


\section{SHSODM for Gradient-Dominated Stochastic Optimization}
\label{section:shsodm-and-vr-shsodm}

In \cref{subsec:shsodm}, we give the details of our SHSODM for the gradient-dominated stochastic optimization when $\alpha \in [1, 2]$ and provide its sample complexity analysis. For the case $\alpha \in [1, 3/2)$, we further incorporate the variance reduction techniques into SHSODM and present an improved sample complexity result in \cref{subsec:sto-hsodm-vr}.

\subsection{SHSODM}
\label{subsec:shsodm}
We begin by outlining the key steps of SHSODM. Firstly, we randomly draw a sample set $S$, and use it to construct the stochastic approximation of the gradient and the Hessian, respectively, by $ \hat{g}_k = \nabla_{S} F( x_{k}) = \sum _{i=1}^{|S|} \nabla f
( \bx ,\xi _{i}) / |S|$ and $ \hat{H}_k = \nabla_{S}^2 F( x_{k})  = \sum _{i=1}^{|S|} \nabla^2 f ( \bx ,\xi _{i}) / |S |$. Then, at each iteration $k$, we employ \cref{algo:ls} and \cref{algo:per} (if necessary) to obtain the update direction $d_k$, which must terminate in $\mathcal{O}(\log( 1/\epsilon _{\text{ls}} \epsilon _{\text{eig}}))$ iterations. Finally, we choose the stepsize $\eta_k = 1$ for all $k$ and update $x_{k+1} = x_k + d_k$. The details are provided in \cref{algo:hsodm}.

Before analyzing the sample complexity of SHSODM, we make some assumptions used throughout the paper.
\begin{assumption}
    \label{assum:smooth}
    Assume that function $F(\cdot)$ is twice differentiable. The gradient of $f(x, \xi)$ and the Hessian of $F(x)$ are Lipschitz continuous, respectively. That is, there exists $L_g > 0$ and $L_H > 0$ such that $\Vert \nabla f(\bx,\xi) -\nabla f(\bx,\xi) \Vert \leq L_g\Vert \bx-\by\Vert$ and $\Vert \nabla^2 F(\bx) -\nabla^2 F(\by) \Vert \leq L_H \Vert \bx-\by\Vert$ for all $\bx, \by \in \bbR^n $ and $\xi$ almost surely.
\end{assumption}
\begin{assumption}[\citealt{masiha2022stochastic}]\label{asp:var}
    Assume that for each query point $\bx \in \mathbb{R}^d$, the stochastic gradient and Hessian are unbiased, and their variance satisfies $\EE[\left\| \nabla F(\bx) - \nabla f(\bx, \xi) \right\|^2 ]\leq \sigma^2_g$ and $\EE[\left\| \nabla^2 F(\bx) - \nabla^2 f(\bx, \xi) \right\|^{2\alpha} ]\leq \sigma^2_{h,\alpha}$, where $\sigma_g > 0$ and $\sigma_{h, \alpha} > 0$ are two constants.
\end{assumption}
\cref{assum:smooth} ensures the Hessian of function $F(\cdot)$ and the gradient of stochastic function $f(x, \xi)$ are both Lipschitz continuous, which is widely used in the optimization literature \citep{nesterovLecturesConvexOptimization2018, masiha2022stochastic, chayti2023unified}. \cref{asp:var} guarantees the stochastic gradient and Hessian are unbiased and have bounded variances, which is standard in stochastic optimization and also used by \cite{masiha2022stochastic}. 
\begin{remark}
When analyzing SGD under the non-convex setting, \cite{khaled2022better} propose the expected smoothness assumption, which is weaker than \cref{asp:var}. However, it is difficult to analyze the behavior of second-order algorithms under this assumption, since it cannot control the term $\mathcal{O}(\mathbb{E}[\|g_k-\hat{g}_k\|^2]^{\alpha/2}+\mathbb{E}[\|H_k-\hat{H}_k\|^{2\alpha}])$ emerging in the analysis, which critically influences the sample complexity. Hence, We leave the relaxation of \cref{asp:var} for future work.
\end{remark}
\begin{algorithm}[!htb]
% \small
    \caption{SHSODM}
    \label{algo:hsodm}
    \KwIn{Total number of iterations $K$, sample size $n_g,n_H$, tolerance $\epsilon_{\text{ls}},\epsilon_{\text{eig}}$, lower bound $\delta_l$ and upper bound $\delta_r$ of the linesearch procedure}
    \For{$k\leftarrow 1$ \KwTo $K$}{
    Draw samples with $ |S^g_k| = n_g$ and $|S^H_k| = n_H  $ \\
    Construct the empirical estimators $ \hat{g}_k $ and $\hat{H}_k  $ \\
    Call \cref{algo:ls} with $(\hat{g}_k, \hat{H}_k, \epsilon_{\text{ls}}, \epsilon_{\text{eig}}, \delta_l, \delta_r)$ to obtain $ \delta_k $ and $d_k $\;
    % }
    Update current point $ x_{k+1} = \bx_k + d_k$.\;
    }
    \Return{$\bx_K$}\;
\end{algorithm}

In the next theorem, we give the sample complexity of SHSODM. It partitions the interval $[1, 2]$ into three non-overlapping subsets, and provides respectively the corresponding sample complexity. When $\alpha = 1$, SHSODM achieves the worst sample complexity of $\mathcal{O}(\epsilon^{-2.5})$. While when $\alpha = 2$, it achieves the best sample complexity of $\tilde{\mathcal{O}}(\epsilon^{-2.5})$. This result matches the one of SCRN \citep{masiha2022stochastic}. However, our SHSODM does not need to solve linear systems, hence its per-iteration cost is less than SCRN's, which is also validated by \cref{section:numerical}. We also remark that, theoretically, we only need the access to stochastic gradient and HVP, instead of stochastic Hessian. We present the remaining theorems below in the form of sampling Hessian only for the comparison with SCRN.
\begin{theorem}[Sample Complexity of SHSODM]\label{thm:hsodm}
    Suppose that function $F(\cdot)$ satisfies \cref{assum:smooth}, \cref{asp:var} and \cref{asp:gd_wdom} with equation \cref{eq:gd} for some $\alpha \in [1, 2]$. Given tolerance $\epsilon$ and let $n_g = \mathcal{O}(\epsilon^{-2/\alpha})$, $n_H = \mathcal{O}(\epsilon^{-1/\alpha})$, $\epsilon _{\text{eig}} =\mathcal{O}( \epsilon ^{1/\alpha })$, and $\epsilon _{\text{ls}} =\mathcal{O}( \epsilon ^{1/\alpha })$.
    Then \cref{algo:hsodm} outputs a solution $ \bx_K $ such that $ \mathbb{E}[F(\bx_K)-F(\bx^*)]\le\epsilon $ after $K$ iterations, where
    \begin{enumerate}
        \item[(1)] If $ \alpha \in [ 1,3/2)$, then $K  = \mathcal{O}(\epsilon^{-3/(2\alpha)+1}))$ with a sample complexity of $\mathcal{O}(\epsilon ^{-7/( 2\alpha ) +1})$.
        \item[(2)] If $\alpha =3/2$, then $K =\mathcal{O}(\log( 1/\epsilon))$ with a sample complexity of $O\left(\log( 1/\epsilon ) \epsilon ^{-2/\alpha }\right) $.
        \item[(3)] If $\alpha \in ( 3/2,2]$, then $ K = \mathcal{O}(\log\log( 1/\epsilon))$ with a sample complexity of $O\left(\log\log( 1/\epsilon ) \epsilon ^{-2/\alpha }\right)$.
    \end{enumerate}
\end{theorem}

In the context of policy-based RL, \cref{asp:gd_wdom} holds with equation \cref{eq:wgd} for $\alpha = 1$. The following corollary emphasizes the sample complexity of SHSODM under this specific setting, which is used in our numerical experiments.
\begin{corollary}[Informal, Sample Complexity for Policy-Based RL]\label{cor:rl}
Under policy-based RL, \cref{assum:smooth} and \cref{asp:var} hold. Moreover, \cref{asp:gd_wdom} holds with equation \cref{eq:wgd} for $ \alpha = 1 $. Then, \cref{algo:hsodm} outputs a solution $ \bx_K $ such that $ \mathbb{E}[F(\bx_K)-F(\bx^*)]<\epsilon + \epsilon_{\text{weak}}$ with a sample complexity of $ \mathcal{O}(\epsilon ^{-2.5}) $.
\end{corollary}
As a byproduct, we also study the performance of the deterministic counterpart of SHSODM for the gradient-dominated function. The next corollary shows it matches the convergence rate proved for CRN \citep{nesterov2006cubic}, and still avoids solving the linear system.
\begin{corollary}[Deterministic Setting]
    Under the gradient-dominated deterministic setting, the iteration complexity of the deterministic counterpart of SHSODM is $ \mathcal{O}(\epsilon ^{-3/( 2\alpha ) +1}) $ when $ \alpha \in [ 1,3/2)$, $ O\left(\log( 1/\epsilon ) \right)$ when $ \alpha =3/2$, and $ O\left(\log\log( 1/\epsilon ) \right)$ when $ \alpha \in ( 3/2,2]$.
\end{corollary}

\subsection{SHSODM with Variance Reduction}
\label{subsec:sto-hsodm-vr}
In the previous analysis, we employ the same batch size at each iteration. 
In this subsection, we enhance the sample complexity of SHSODM when $ \alpha \in [1,3/2)$ via time-varying batch size and variance reduction technique proposed in \cite{fang2018spider}. 
In particular, we let $K_C$ be the period length and $S_k$ be the sample set to estimate gradient at iteration $k$. 
Then, we construct different empirical estimators of the gradient and the Hessian based on whether $k$ is a multiple of $K_C$. 
In \cref{algo:hsodm_vr}, we present the explicit forms of the newly introduced estimators, and give the details of SHSODM with variance reduction technique, or VR-SHSODM mentioned in \cref{table:comparsion}. 

The next theorem specifies the choice of time-varying batch size and shows the sample complexity of VR-SHSODM can be further improved to $ O\left( \epsilon ^{-2/\alpha}\right) $ when $ \alpha \in [1,3/2)$. This also applies to weak gradient dominance property with $\alpha \in [1,3/2)$. When $\alpha = 1$, we enhance the sample complexity from $\mathcal{O}(\epsilon^{-2.5})$ in the last subsection to $\mathcal{O}(\epsilon^{-2})$, which improves upon the best-known sample complexity of SGD (or stochastic policy gradient method for RL) and matches the result of SCRN. 
\begin{theorem}[Sample Complexity of VR-SHSODM]
\label{thm:vr}
Under the same assumptions of \cref{thm:hsodm}, if $ \alpha \in [ 1,3/2)$, $ K_C =\mathcal{O}(K)$, and 
{\small
\begin{align*}
    n_{k,g} &=\begin{cases}
    \mathcal{O}( k^{4/( 3-2\alpha )})  & k \bmod K_C=0 \\
    O\left(\Vert d_{k}\Vert ^{2} K_C( \lfloor  k / K_C  \rfloor K_C)^{4/( 3-2\alpha )}\right) & k \bmod K_C\neq 0
    \end{cases}
\end{align*}
}
Then, \cref{algo:hsodm_vr} outputs a solution $ \bx_K $ such that $\mathbb{E}[F(\bx_K)-F(\bx^*)]\le \epsilon$ in $K = \mathcal{O}(\epsilon^{-3/(2\alpha)+1})$ iteration with sample complexity $O\left( \epsilon ^{-2/\alpha}\right) $.
\end{theorem}
\begin{algorithm}[!htb]
    \SetAlgoLined
    \caption{SHSODM with Variance Reduction}
    \label{algo:hsodm_vr}
    \KwIn{maximum iteration $K$, parameters $K_{C}$, $\epsilon_{\text{ls}}$, $\epsilon_{\text{eig}}$ }
    \For{$k\leftarrow 1$ \KwTo $K$}{
        Draw samples with $|S_k| = n_{k,g}$\;
        \eIf{$k~\mathrm{mod}~K_C = 0$}
        {
        $ v_{k} \leftarrow  \nabla_{S_{k}} f( x_{k}) $\;
        $ H_{k} \leftarrow  \nabla_{S_{k}}^2 f( x_{k}) $\;
        }
        {
        $ v_{k} \leftarrow  \nabla_{S_{k}} f( x_{k}) -\nabla_{S_{k}} f( x_{k-1}) +v_{k-1} $ \;
        $ H_{k} \leftarrow  \nabla_{S_{k}}^2 f(x_{k}) -\nabla_{S_{k}}^2 f(x_{k-1}) +H_{k-1} $ \;}
        Call \cref{algo:ls} with $\hat{g}_k, \hat{H}_k, \epsilon_{\text{ls}}, \epsilon_{\text{eig}}, \delta_l, \delta_r$  to get $ \delta_k, d_k $\;
      Update current point $ x_{k+1} = x_k + d_k$\;
    }
    \Return{$x_K$}\;
\end{algorithm}

\section{Numerical Experiments}
\label{section:numerical}

\begin{figure*}[htb]
\begin{minipage}{0.49\linewidth}
    \centering
    \includegraphics[width=0.9\linewidth]{figs/halfcheeta.png}
    \subcaption{HalfCheetah-v2}
\end{minipage}
\begin{minipage}{0.49\linewidth}
    \centering
    \includegraphics[width=0.9\linewidth]{figs/walker.png}
    \subcaption{Walker2d-v2}
\end{minipage}

\begin{minipage}{0.49\linewidth}
    \centering
    \includegraphics[width=0.9\linewidth]{figs/humanoid.png}
    \subcaption{Humanoid-v2}
\end{minipage}
\begin{minipage}{0.49\linewidth}
    \centering
    \includegraphics[width=0.9\linewidth]{figs/hopper.png}
      \subcaption{Hopper-v2}
\end{minipage}
\caption{\textbf{The $x$-axis and $y$-axis represents respectively system probes and the average return.} The solid curves depict the mean values of five independent simulations, while the shaded areas correspond to the standard deviation.}\label{fig:rl-results}
\end{figure*}

\begin{table*}
    \centering
    \begin{tabular}{lcccc}
    \hline
Environment   & SHSODM & SCRN & TRPO & VPG \\
\hline
HalfCheetah-v2 &  $ \boldsymbol{ 2259 \pm 217 } $     &  $1115 \pm 230$     &  $1822 \pm 214 $  & $1078 \pm 319$   \\
Walker2d-v2 &  $ \boldsymbol{1630 \pm 224} $  &  $ 543 \pm 164 $  &  $ 1389 \pm 265 $ &  $ 1024 \pm 448 $ \\
Humanoid-v2 &  $  \boldsymbol{498 \pm 20} $  &  $ 484 \pm 43 $   &  $ 458 \pm 20 $ & $ 495 \pm 28 $  \\
Hopper-v2 & $\boldsymbol{ 1856 \pm 74 }$ &   $ 1473 \pm 318 $  &  $ 1752 \pm 86 $    &  $ 1779 \pm 67 $  \\
\hline
    \end{tabular}
    \caption{Max average return $\pm$ standard deviation over $5$ trials of $10^7$ system probes. Maximum value for each task is bolded.}
    \label{tab:rl-ma-return}
\end{table*}





In this section, we evaluate the empirical performance of our SHSODM in the context of RL, which serves as a standard scenario for gradient-dominated stochastic optimization \citep{masiha2022stochastic}. In RL, the interaction between the agent and the environment is often described by an infinite-horizon, discounted Markov decision process (MDP), and the agent aims to maximize the discounted cumulative reward. Due to the page limit, we leave the brief introduction of MDP in the Appendix. It can be shown under some standard assumptions that the objective function of MDP has the gradient dominance property with $\alpha = 1$.

We compare our SHSODM with different algorithms, including SCRN~\citep{masiha2022stochastic}, TRPO~\citep{schulman2015trust} and VPG~\citep{williams1992simple}. In the Appendix, we further provide the experiments that compare SHSODM with PPO~\citep{schulman2017proximal}. As mentioned before, SCRN is a second-order method that harbors the best-known sample complexity for gradient-dominated stochastic optimization. TRPO, and its variants of PPO, are among the most important workhorses behind deep RL and enjoy empirical success. Practically speaking, TRPO can be seen as a ``second-order'' method, since it uses the second-order information of the constraints to update the policy. The first-order method VPG is included to serve as a benchmark and validate our theoretical analysis, which is a common practice in the literature \citep{tripuraneni2018stochastic, kohler2017sub, zhou2018convergence}. We implement SHSODM by using garage library \citep{garage} written with PyTorch \citep{paszke2019pytorch}, which also provides the implementation of TRPO and VPG. For SCRN, we use the open-source code offered by \cite{masiha2022stochastic}. 

In our experiments, we test several representative robotic locomotion experiments using the MuJoCo\footnote{\url{https://www.mujoco.org}.} simulator in Gym\footnote{Gym is an open source Python library for developing and comparing RL algorithms; see \url{https://github.com/openai/gym}}. The task of each experiment is to simulate a robot to achieve the highest return through smooth and safe movements. 
Specifically, we consider $4$ control tasks with continuous action space, including \texttt{HalfCheetah-v2}, \texttt{Walker2d-v2}, \texttt{Humanoid-v2}, and \texttt{Hopper-v2}. For each task, we employ a Gaussian multi-layer perceptron (MLP) policy whose mean and variance are parameterized by an MLP with $2$ hidden layers of $64$ neurons and $\tanh$ activation function. We let the batch size be $10^4$, and the number of epoch be $10^3$, leading to a total of $10^7$ time step. To ensure a fair comparison, we adapt the same network architecture for all algorithms.

When computing the discounted cumulative reward, a baseline term is subtracted in order to reduce the variance. It is noteworthy that the resulting gradient estimator is still unbiased. For all methods, we train a linear feature baseline, which has been implemented in garage. For the hyperparameters of each algorithm, we emphasize that garage's implementations of TRPO and VPG are robust, and tuning parameters does not lead to a significant change of the performance. Hence, we use their default parameters. For SCRN, we use grid search to find the best hyperparameter combination for each environment. For SHSODM, we choose the trust region radius $r \in \{ 0.05, 0.08 \}$. Although our theory always sets it to be $1$ (see the constraint in \cref{eq:homo-model}), we make it a hyperparameter in the implementation. To conduct the experiments, we utilize a Linux server with Intel(R) Xeon(R) CPU E5-2680 v4 CPU operating at 2.40GHz and 128 GB of memory, and NVIDIA Tesla V100 GPU.

Following the existing literature \citep{masiha2022stochastic, huang2020momentum}, we use the system probes, i.e., the number of sampled state-action pairs, as the measure of sample complexity instead of the number of trajectories due to that the different trajectories may have varying lengths. For each task, we run each algorithm with a total of $10^7$ system probes, employing $5$ different random seeds for both network initialization and the Gym simulator. 

\textbf{Overall Comparisons.} \cref{fig:rl-results} presents the training curves over the aforementioned environments in MuJoCo. It is clear that our SHSODM outperforms SCRN and VPG in all tested environments. When compared with TRPO, although SHSODM grows slowly at the beginning, it achieves the best final performance with an obvious margin. It can be also observed that SCRN is less robust and even inferior to the first-order algorithm VPG. This phenomenon instead reflects that our SHSODM enjoy more stable performance. To further illustrate the robustness of SHSODM, following \cite{zhao2022stochastic}, we provide the maximal average return and the standard deviation over five trials in \cref{tab:rl-ma-return}. A higher maximal average return indicates the algorithm has the ability to obtain the better agent. \cref{tab:rl-ma-return} shows SHSODM has the highest maximal average return, and outperforms other algorithms over all tested environments. 


\textbf{Cost in Computing Directions.} We now compare the computational efficiency of SHSODM with SCRN, since they are both second-order algorithms. We profile the total time required to compute the update direction for the two algorithms over some representative environments. To demonstrate, we let each algorithm run for $10^3$ epochs, and the batch size of each epoch be $10^3$ as well. \cref{fig:time} shows that our SHSODM needs much less time than SCRN, thus it is more efficient. We remark that the result also has some theoretical guarantees. By analyzing the Hessian of the objective function, we observe that its condition number $\kappa$ is huge in each iteration, typically $10^7$. When one uses the conjugate gradient method or the gradient descent method to solve the cubic-regularization subproblem required by SCRN, the time complexity depends heavily on the Hessian's condition number \citep{golub_matrix_2013}. This fact brings the unacceptable computational burden given the huge-condition-number nature of the Hessian. Fortunately, for SHSODM, the Lanczos-type algorithm we use to solve the eigenvalue problem at each iteration is proved to be free of condition number \citep{saadNumericalMethodsLarge2011}. Therefore, SHSODM is more suitable to deal with such ill-conditioned problems emerging in RL. 

\begin{figure}[H]
    \centering
    \includegraphics[scale=0.5]{figs/time.png}
    \caption{The $x$-axis represents three different tested environments, including \texttt{HalfCheetah-v2}, \texttt{Hopper-v2}, and \texttt{Walker2d-v2}. The $y$-axis presents the total time required to obtain the update direction in $10^3$ epochs.}
    \label{fig:time}
\end{figure}

\section{Conclusion}
\label{section:conclusion}
Gradient dominance property enjoys wide applications in real life. In this paper, we extend HSODM from non-convex optimization to gradient-dominated stochastic world, which requires two extra customized strategies and differs from HSODM in nature. Consequently, we propose a novel stochastic second-order algorithm, SHSODM. It inherits the advantage of HSODM, only requiring solving an eigenvalue problem at each iteration, which is computationally cheaper than the cubic-regularized subproblem required by other second-order methods such as SCRN. Theoretically, we prove that SHSODM has the sample complexity matching the best-known result. 
Meanwhile, we demonstrate by several reinforcement learning tasks that SHSODM not only has a better and more stable performance than SCRN and other widely used algorithms in deep RL, but also is more efficient and robust in handling some ill-conditioned optimization problems in practice. 
For future research, one interesting direction is to apply the homogenization method to stochastic nonsmooth optimization.



\section*{Acknowledgement}
The authors are grateful to the Area Chairs and the anonymous reviewers for their constructive comments. This research is partially supported by the Major Program of National Natural Science Foundation of China (Grant 72394360, 72394364).