
\section{Introduction}
In this work, we study the problem of recovering an unknown matrix $\target \in \mathbb{R}^{d_1 \times d_2}$ from its random linear measurements $\RHS := \mathcal{A}(\target) \in \mathbb{R}^m$, where the linear operator $\mathcal{A}: \domain \rightarrow \mathbb{R}^m$ is defined as
\begin{equation}
[\mathcal{A}(\mathbf{X})]_i:=\frac{1}{\sqrt{m}}\left\langle\bm{A}_i, \mathbf{X}\right\rangle, \qquad i=1,2, \ldots, m.
\label{eq: operator A}
\end{equation}
Here, $\bm{A}_i\in\domain$ are measurement matrices, $\langle \cdot,\cdot\rangle$ is the standard inner product in $\domain$, and $m \ll d_1 d_2$, making the problem inherently underdetermined. To overcome this challenge, we assume that $\target$ has rank $r$, effectively reducing the degrees of freedom in the matrix to $r(d_1 + d_2-r)$. 
Under this assumption, exact recovery of $\target$ becomes theoretically feasible when the number of measurements $m$ scales on the order of this degree of freedom.
This problem, known as low-rank matrix recovery problem, lies at the intersection of theoretical and applied mathematics, with profound implications across machine learning, signal processing, and statistics. It encompasses several classical problems, such as matrix completion \citep{candes_power_2010_completion, gross_recovering_2011_completion,sun_Luo_guaranteed_2016}, phase retrieval \citep{candes_phaselift_2013}, and quantum state tomography \citep{hsu_quantum_2024}, among others \citep{chi_nonconvex_2019}.
The core challenge lies in recovering $\target$ using as few measurements $m$ as possible, ideally matching the information-theoretic lower bound of $\Omega(r(d_1 + d_2-r))$, while ensuring that the recovery method remains computationally efficient, operating in polynomial time as problem dimensions grow.
%In theoretical analysis, the problem where the measurement operators are drawn from a Gaussian distribution is often referred as matrix sensing. 
%While this assumption on the measurement operators is strong, the results are highly insightful and serve as the golden standard for studying linear measurement operators.

A prominent line of research focuses on convex relaxation methods, where the low-rank matrix is represented in $\mathbb{R}^{d_1 \times d_2}$, and the nuclear norm $\|\cdot\|_{*}$ is used as a convex surrogate for the rank function. 
For applications such as matrix sensing \citep{recht_guaranteed_2010}, matrix completion \citep{candes_power_2010_completion, gross_recovering_2011_completion}, and blind deconvolution and demixing \citep{jung_blind_2017}, it has been shown that this approach can achieve exact recovery with $m$ scaling as $\Omega(r(d_1 + d_2))$, up to logarithmic factors, matching the information-theoretically optimal sample complexity. 
However, these convex methods are computationally demanding, as they require optimization in the entire space $\domain$, and the low-rank structure of the solution is not easily exploited.

To address these computational challenges, non-convex approaches have gained prominence. 
Factorization-based methods address this by representing the low-rank matrix as $\bm{L}\bm{R}^T$, where $\bm{L} \in \mathbb{R}^{d_1 \times r}$ and $\bm{R} \in \mathbb{R}^{d_2 \times r}$. This reduces the number of optimization variables to $r(d_1 + d_2)$, significantly fewer than the $d_1 d_2$ variables in convex approaches.
Simple algorithms such as gradient descent and alternating minimization, when initialized appropriately, have been shown to converge linearly to the global minimum under suitable assumptions on $\mathcal{A}$ and $\target$ \citep{jain2013low,tu_low-rank_nodate,chen2020nonconvex2,sun_Luo_guaranteed_2016,tong_accelerating_nodate,charisopoulos_low-rank_2019,zilber2022gnmr}.
Another class of non-convex methods leverages manifold optimization, eliminating redundancy in the factorization parametrization either by representing factors on quotient Riemannian manifolds \citep{keshavan2009matrix, huang2017solving, zheng2022riemannian} or by optimizing directly on the Riemannian manifold of rank-$r$ matrices embedded in $\domain$ \citep{RGD, cai_solving_2018_RGD,hsu_quantum_2024}.
These methods are often more efficient and have also been proven to converge linearly to $\target$ with the spectral initialization under appropriate conditions. 
However, a critical limitation of fast non-convex approaches is their suboptimal sample complexity, typically requiring $m=\Omega(r^2(d_1 + d_2))$ or higher, which scales quadratically with $r$.
Iterative Hard Thresholding (IHT) \citep{tanner2013normalized, tu_low-rank_nodate} achieves $m = \Omega(r(d_1 + d_2))$, but its computational cost is higher than the aforementioned fast non-convex methods due to repeated $r$-truncated singular value decompositions (SVD) on full matrices, which incur larger constant factors compared to matrix multiplication (MM) of the same computational order.
% \citep{tanner2013normalized, jain2010guaranteed_SVP, blanchard2015cgiht}  

The feasibility of simultaneously achieving optimal sample complexity and low computational cost remains an open research question. Recently, \citet{stoger_non-convex_2024} made progress in this direction for the special case of low-rank positive semidefinite (PSD) matrix sensing. By assuming Gaussian measurement matrices and representing the PSD matrix as $\bm{L}\bm{L}^T$, the authors demonstrated that factorized gradient descent can recover $\target$ with sample complexity $m = \Omega(rd_1)$. However, their approach suffers from slow convergence for ill-conditioned matrices due to the dependence of the step size on the condition number of $\target$. Moreover, extending these results to the more general case of non-PSD matrix recovery introduces additional challenges, particularly in balancing the factors $\bm{L}$ and $\bm{R}$ without explicit regularization \citep{chen2020nonconvex2}.




In this paper, we present a theoretical result showing that Riemannian gradient descent (RGD) \citep{RGD} achieves both optimal sample complexity and low computational cost for recovering rectangular low-rank matrices. Specifically, we prove that RGD can recover a rank-$r$ matrix with optimal sample complexity $m = \Omega(r(d_1 + d_2))$ when $\mathcal{A}$ is a Gaussian measurement operator, achieving an arbitrarily small convergence rate. Unlike factorized gradient descent, our approach eliminates the need for additional regularization terms, simplifying both the theoretical analysis and the practical implementation. Furthermore, RGD is computationally efficient, as it parameterizes matrices on the Riemannian manifold with only $\Theta (r(d_1 + d_2))$ variables. By reducing the sample complexity from quadratic to linear dependence on $r$, our work bridges the gap between optimal sample complexity and computational efficiency, establishing RGD as a state-of-the-art method for low-rank matrix recovery. 
Table~\ref{table: comparison} provides a summary of the sample complexity $m$ and computational efficiency for representative non-convex methods in low-rank matrix sensing (all quantities are stated up to order $O(\cdot)$). 
The per-iteration computational cost consists of two parts: (1) the common cost of applying $\mathcal{A}^*\mathcal{A}$ (dominated by matrix multiplication, MM), and (2) method-specific cost highlighted in Table~\ref{table: comparison}. 
It may include extra MM and complex operations like QR decomposition, matrix inversion, and SVD.
% While the $r$-truncated SVD for $d_1 \times d_1$ matrices has the same asymptotic complexity as multiplying a $d_1 \times d_1$ matrix by a $d_1 \times r$ matrix ($O(d_1^2 r)$), in practice, SVD typically incurs much larger constant factors and is significantly slower. 

\begin{table*}[htbp]
    \caption{Comparison of Non-Convex Methods for Low-Rank Matrix Sensing ($d_1 = d_2$).}
        \label{table: comparison}
        \begin{tabular}{c|c|c|c}
            Method & $m$ & Iterations & Extra Cost/Iter \\
            \hline\hline
            SVP~\citep{jain2010guaranteed_SVP}, NIHT~\citep{tanner2013normalized} & $d_1r$ & $\log(1/\varepsilon)$ & $d_1^2 r$ (SVD) \\
            \hline
            RGD~\citep{RGD} & $d_1r^2\kappa^2$ & $\log(1/\varepsilon)$ &  $d_1^2 r$ (MM) + $d_1r^2$(QR) + $r^3$ (SVD)  \\
            \hline
            Scaled GD~\citep{tong_accelerating_nodate} & $d_1r^2\kappa^2$ & $\log(1/\varepsilon)$ &  $d_1^2 r$ (MM) + $r^3$ (Inversion) \\
            \hline
            Factorized GD (PSD only)~\citep{stoger_non-convex_2024} & $d_1r\kappa^2$ & $\kappa^2\log(1/\varepsilon)$ & $d_1^2 r$ (MM) \\
            \hline
            RGD (this paper) &$d_1r\kappa^2$ & $\log(1/\varepsilon)$ &  $d_1^2 r$ (MM) + $d_1r^2$(QR) + $r^3$ (SVD) \\
            \hline
        \end{tabular}
\end{table*}


The rest of the paper is organized as follows. In \cref{sec: results}, we formulate the non-convex optimization problem for low-rank matrix recovery, describe the Riemannian gradient descent algorithm, and present our main theoretical result, \cref{thm: RGDmain}. \Cref{sec: Proof main} provides the proof of the main theorem, with the Restricted Isometry Property (RIP) and the decoupling technique as key tools. Most technical details are deferred to the Appendix. Finally, we conclude with a discussion of potential directions for future research in \cref{sec: conclude}.

\section{Algorithms and Results}\label{sec: results}
In this section, we first formulate low-rank matrix recovery as a non-convex optimization problem on the Riemannian manifold of all rank-$r$ matrices embedded in $\domain$. We then describe the Riemannian gradient descent algorithm for solving this optimization problem. Finally, we present our main theoretical result.

\subsection{Algorithms}
To recover the rank-$r$ matrix $\target \in \domain$ from its measurement $\RHS = \mathcal{A}(\target)$, we solve the constrained least-squares problem:
\begin{equation}
\begin{aligned}
\min_{\bm{X}\in\domain} & \quad\mathcal{L}(\mathbf{X}):=\frac{1}{2}\left\|\RHS-\mathcal{A}\left(\bm{X}\right)\right\|_2^2\\
\text { s.t. } & \quad\operatorname{rank}(\bm{X})=r. 
\end{aligned}
\label{eq: PhaseLift trace minimization SDP}
\end{equation}

Solving \eqref{eq: PhaseLift trace minimization SDP} is challenging due to the non-convexity introduced by the low-rank constraint. A common approach to overcome this is to use matrix factorization, parametrizing the low-rank matrix as $\boldsymbol{X}=\boldsymbol{L} \boldsymbol{R}^T$ with $\boldsymbol{L} \in \mathbb{R}^{d_1 \times r}, \boldsymbol{R} \in \mathbb{R}^{d_2 \times r}$. This leads to the following optimization problem:
\begin{equation}\label{factorization formulation}
 \min _{\boldsymbol{L} \in \mathbb{R}^{d_1 \times r}, \boldsymbol{R} \in \mathbb{R}^{d_2 \times r}} \mathcal{L}(\bm{L}\bm{R}^T).   
\end{equation}

However, the factorization $\bm{X}=\bm{L}\bm{R}^T$ is redundant and non-unique. Specifically, $\bm{X}=(\bm{LQ})(\bm{RQ}^{-T})^{T}$ for any invertible $r\times r$ matrix $\bm{Q}$. This invariance causes the critical points of $\mathcal{L}$ to be unbounded and not isolated in parameter space, leading to potential optimization difficulties. To address this issue, some works simply assume that $\bm{L}=\bm{R}$ to recover PSD matrices \citep{stoger_non-convex_2024}, while others introduce an imbalance regularization term $\|\bm{L}^T\bm{L}-\bm{R}^T\bm{R}\|_{F}$ to the loss function in \cref{factorization formulation} \citep{tu_low-rank_nodate,ge2017no}. Despite these approaches, the factorization $\bm{L}\bm{R}^T$ can still lead to an ill-conditioned Hessian. To analyze this, assume $\mathcal{A}$ is random and $\mathbb{E}[\mathcal{A}^*\mathcal{A}]=\mathcal{I}$. This assumption holds in many common low-rank matrix recovery problems, such as Gaussian matrix sensing, matrix completion, and quantum state tomography. We then consider the behavior of the expected loss function in \cref{factorization formulation}, which is $\mathbb{E}[\mathcal{L}(\bm{L}\bm{R}^T)]=\frac{1}{2}\|\bm{L}\bm{R}^T-\target\|_F^2$. The Hessian of $\mathbb{E}[\mathcal{L}]$ with respect to (w.r.t.) $\bm{L}$ and $\bm{R}$ is given by:
\begin{equation*}
\nabla^2_{(\bm{L},\bm{R})}(\mathbb{E}[\mathcal{L}(\bm{LR}^T)])=
\begin{bmatrix}
(\bm{R}^{T}\bm{R})\otimes\bm{I}_{d_1} 
& 
\bullet
\\
\bullet^{T}
& (\bm{L}^{T}\bm{L})\otimes\bm{I}_{d_2}
\end{bmatrix},
\end{equation*}
where $\bullet=\bm{I}_r\otimes(\bm{LR}^T-\target)+(\bm{R}^T\otimes \bm{L})\bm{K}^{(d_2,r)}$ and
$\bm{K}^{(d_2,r)}$ is the commutation matrix \citep{von1988moments}.
%\bm{I}_r\otimes(\bm{LR}^T-\target)^T+(\bm{L}^T\otimes \bm{R})\bm{K}^{(d_1,r)}
%\bm{I}_r\otimes(\bm{LR}^T-\target)+(\bm{R}^T\otimes \bm{L})\bm{K}^{(d_2,r)}

%\begin{equation}
%    \nabla^2_{\bm{L},\bm{L}}(\mathbb{E}[\mathcal{L}]) = (\bm{R}^{T}\bm{R})\otimes\bm{I}_{d_1},\quad\nabla^2_{\bm{R},\bm{R}}(\mathbb{E}[\mathcal{L}]) = (\bm{L}^{T}\bm{L})\otimes\bm{I}_{d_2}.
%\end{equation}
The condition number of the Hessian depends on those of $\bm{L}$ and $\bm{R}$, which slows convergence and ties the convergence rate to the condition number of $\target$. 
To mitigate this, various approaches have been proposed, including preconditioning in parameter space by the inversion of the block diagonal of $\nabla^2_{(\bm{L},\bm{R})}(\mathbb{E}[\mathcal{L}(\bm{LR}^T)])$ \citep{tong_accelerating_nodate},
optimization on quotient Riemannian manifolds \citep{keshavan2009matrix, huang2017solving, zheng2022riemannian},
and on the Riemannian manifold of rank-$r$ matrices embedded in $\domain$ \citep{RGD, cai_solving_2018_RGD,hsu_quantum_2024}.

We consider the optimization over the embedded Riemannian manifold of rank-$r$ matrices, which offers several advantages. First, the manifold representation is intrinsic, eliminating redundancy and the need for regularization in factorization-based methods. Second, the embedded manifold lies in $\domain$, where the expected loss function simplifies to $\mathbb{E}[\mathcal{L}(\bm{X})]=\frac{1}{2}\|\bm{X}-\target\|_{F}^2$ and the expected Hessian becomes $\mathcal{I}$, with a perfect condition number. This ensures fast convergence. 
Third, the operator $\mathcal{A}$ acting on matrices in $\domain$ is well-studied, with benign properties such as RIP that can simplify analysis. 
In contrast, its behavior in the parameter space is less understood, requiring additional work to generalize these properties \citep{tong_accelerating_nodate,stoger_non-convex_2024}.

%Thus, our approach is simpler and more natural.
%RIP matrix 
%replace more analysis to pass the property from matrix to factors space.
 
%The measurement operator $\mathcal{A}$, defined in $\domain$ is well studied. While we need an intermediate variable to connect $\mathcal{A}$ and the factors. The relationship is not explicit and needs more careful analysis. Thus, our analysis is more natural and simple. 


%Third, the measurement operator $\mathcal{A}$ acting on the matrices in the $\domain$, is well studied, and we can leverage existing theoretical results to simplify our analysis.     



% Ai in domain,  make use of easy proof
%This ensures that $\mathcal{L}$ has a nearly isometry Hessian on every tangent space.
%The well-conditioned Hessian leads to a convergence rate independent of $\kappa$,making the algorithm more computationally efficient.


% $$\mathcal{L}(\mathbf{X}):=\frac{1}{4}\left\|\RHS-\mathcal{A}\left(\bm{X}\right)\right\|_2^2=\frac{1}{4}\left\|\mathcal{A}\left(\mathbf{X}_{\star}-\bm{X}\right)\right\|_2^2$$
% $$
% \nabla \mathcal{L}(\mathbf{X})
% $$

% Introduction about Riemanian gradient descent
% Introduce tangent plane
% Introduce Projection operator
% Introduce retraction (here we use hard thresholding)

%Riemannian manifold of all rank-$r$ matrices embedded in $\domain$, which offers several advantages.
%First, no additional balanced regularization term is required in the objective function.
%Second, Riemannian optimization methods are computationally efficient, as matrices on the Riemannian manifold can be parameterized with as few as $O(r(d_1 + d_2))$ parameters.
%Furthermore, it has been shown that these methods are guaranteed to converge to the underlying low-rank matrix with an arbitrarily small convergence rate, comparable to preconditioned factorized gradient descent \citep{RGD, cai_solving_2018_RGD}.
Let 
$
\mathbb{M}_r = \{\bm{X}\in\domain:\operatorname{rank}(\bm{X})=r\}    
$
be the embedded manifold of all rank-$r$ matrices in $\domain$.
For $\bm{X}\in\mathbb{M}_r$, given its compact singular value decomposition (SVD) of  $\bm{X}=\bm{U}\bm{\Sigma}\bm{V}^T$,
the tangent space at $\bm{X}$ is
$$
\mathbb{T}_{\bm{X}}:=\left\{\bm{U} \bm{R}^T + \bm{L} \bm{V}^T: \bm{L} \in \mathbb{R}^{d_1 \times r},\bm{R} \in \mathbb{R}^{d_2 \times r}  \right\}.   
$$
The orthogonal projection $\mathcal{P}_{\mathbb{T}_{\bm{X}}}: \domain \rightarrow \mathbb{T}_{\bm{X}}$  has the closed-form expression
$$
\mathcal{P}_{\mathbb{T}_{\bm{X}}}(\bm{Z})= \bm{U} \bm{U}^T \bm{Z} + \bm{Z} \bm{V} \bm{V}^T - \bm{U} \bm{U}^T \bm{Z} \bm{V} \bm{V}^T.
$$
Then the constrained least-squares problem \cref{eq: PhaseLift trace minimization SDP} becomes $\min_{\bm{X}\in\mathbb{M}_r}  \mathcal{L}(\mathbf{X}).
$
We solve it using Riemannian gradient descent (RGD) \citep{absil2008optimization,vandereycken_low-rank_2013_RGD}:
\begin{equation}\label{RGD iterative sequence}
    \bm{X}_{t+1} = \mathcal{H}_r(\bm{X}_t-\mu\mathcal{P}_{\mathbb{T}_{\bm{X}_t}}\mathcal{A}^*(\mathcal{A}(\bm{X}_t) - \bm{b})),\forall t\in\mathbb{N},
\end{equation}
where:
\begin{itemize}
    \item \(\mathcal{H}_r(\cdot)\) is 
the hard thresholding operator and serves as a retraction, which is defined via the $r$-truncated SVD $
\mathcal{H}_r(\bm{Z}):= \sum_{i=1}^r\sigma_i\bm{u}_i\bm{v}_i^T$ provided the SVD of $\bm{Z}=\sum_i\sigma_i\bm{u}_i\bm{v}_i^T
$ with $\sigma_1\geq\sigma_2\geq\cdots$,
    \item $\mu$ is the step size, and
    \item $\mathcal{P}_{\mathbb{T}_{\bm{X}_t}}\mathcal{A}^*(\mathcal{A}(\bm{X}_t) - \bm{b})$ is the Riemannian gradient of $\mathcal{L}(\bm{X})$ at $\bm{X}_t$.  
\end{itemize}

The computational cost per iteration of \cref{RGD iterative sequence} is low. Aside from applying $\mathcal{A}$ and $\mathcal{A}^*$, the most expensive operations are $\mathcal{H}_r$ and $\mathcal{P}_{\mathbb{T}_{\bm{X}_t}}$.
Since $\bm{X}_t$ can be stored in a compact SVD form as $\bm{X}_t = \bm{U}_t \bm{\Sigma}_t \bm{V}_t^T$, 
computing $\mathcal{P}_{\mathbb{T}_{\bm{X}_t}}$ requires only $O(r)$ matrix-vector products.
Besides, in \cref{RGD iterative sequence}, $\mathcal{H}_r$ is applied to a matrix $\bm{W}_t$ in $\mathbb{T}_{\bm{X}_t}$, which has rank at most $2r$. As shown in \citep{RGD}, $\mathcal{H}_r(\bm{W}_t)$ can be efficiently computed using two QR decompositions of a tall matrix of width $r$, one SVD of a $2r\times 2r$ matrix, and a few matrix-vector products.
Thus, the per-iteration computational cost of RGD is of the same order as that of gradient descent based on factorization or the quotient Riemannian manifolds. Moreover, RGD achieves a more favorable convergence rate that is independent of the condition number of the ground truth matrix and can be arbitrarily small. This results in fewer iterations to reach the target accuracy, as demonstrated in our theoretical results.

%By utilizing this property, a

Due to the non-convexity of the problem, we also need a good initialization $\bm{X}_0$. We use the spectral initialization outlined in \citep{jain2013low}. We initialize $\bm{X}_0$ as $\mathcal{H}_r(\mathcal{A}^*(\bm{b}))$, where  $\mathcal{A}^*: \mathbb{R}^m \rightarrow \domain$ is the adjoint operator of $\mathcal{A}$. Spectral initialization is a natural and common choice since $\mathbb{E}[\mathcal{A}^*(\bm{b})] = \target$ and the operator $\mathcal{H}_r$  extracts the rank-$r$ structure.

We summarize our algorithm in \cref{alg: RGD}. For simplicity, we denote ${\mathbb{T}_t}$ and $\mathcal{P}_{\mathbb{T}_t}$ as %$\mathcal{P}_{\mathbb{T}}$ 
$\mathbb{T}_{\bm{X}_t}$ and $\mathcal{P}_{\mathbb{T}_{\bm{X}_t}}$, respectively.
%$\mathbb{T}_{\target}$, 
\begin{algorithm}
\caption{Riemannian Gradient Descent (RGD) for Low-Rank Matrix Recovery}
\SetKwInput{KwInput}{Input}
\KwInput{Measurement operator $\mathcal{A}: \domain \rightarrow \mathbb{R}^m$, observations $\mathbf{b} \in \mathbb{R}^m$, step size $\mu>0$}
\textbf{Stage 1 (Spectral Initialization):}
Define the initialization $\mathbf{X}_0 \in \mathbb{R}^{d_1 \times d_2}$ as
$$\bm{X}_0=\mathcal{H}_r(\mathcal{A}^*(\bm{b})).$$
\textbf{Stage 2 (Iteration):}
\For{$t=0,1,2, \ldots$}{
\begin{flalign*}
\bm{W}_t & = \bm{X}_t - \mu \mathcal{P}_{\mathbb{T}_t}\mathcal{A}^*(\mathcal{A}(\bm{X}_t) - \bm{b}), \\
\mathbf{X}_{t+1}&= \mathcal{H}_r(\bm{W}_t).&
\end{flalign*}
}
\label{alg: RGD}
\end{algorithm}
% If $\sigma_r(\bm{Z}) = \sigma_{r+1}(\bm{Z})$, $\mathcal{H}_r$ is not uniquely defined.
% But during the iteration, all $\bm{W}_t$ are close to $\target$ and always have a positive spectral gap with high probability, which will be clearly shown later. 
%Similar to factorized gradient descent,During the iteration, we need to compute the singular value decomposition $\mathcal{H}_r$, which seems to be computationally expensive since the SVD on the full matrix in $\domain$ needs $O(d_1^3)$ floating point operations (flops) when $d_2$ is proportional to $d_1$. 
% However, $\mathcal{H}_r(\bm{W}_t)$ can be implemented efficiently since $\bm{W}_t$ is at most rank $2r$ and the SVD of $\bm{W}_t$ can be computed from the SVD of a smaller size matrix using $O\left(r^3\right)$ flops.
 %To see this, we denote the intermediate $\bm{G}_t := \mu\mathcal{A}^*\mathcal{A}(\bm{X}_t - \target)$, then
% $$
% \begin{aligned}
% \bm{W}_t &= \bm{X}_t - \mathcal{P}_{\mathcal{T}_t}\bm{G}_t = \bm{U}_t \bm{\Lambda}_t \bm{V}_t^T - ( \bm{U}_t \bm{U}_t^T \bm{G}_t + \bm{G}_t \bm{V}_t \bm{V}_t^T - \bm{U}_t \bm{U}_t^T \bm{G}_t \bm{V}_t \bm{V}_t^T) \\
% & =  \bm{U}_t \bm{\Lambda}_t \bm{V}_t^T -  \bm{U}_t \bm{U}_t^T \bm{G}_t\bm{V}_t \bm{V}_t^T - (\bm{I} - \bm{U}_t \bm{U}_t^T)\bm{G}_t \bm{V}_t \bm{V}_t^T - \bm{U}_t \bm{U}_t^T \bm{G}_t (\bm{I} - \bm{V}_t \bm{V}_t^T) \\
% & = \bm{U}_t (\bm{\Lambda}_t - \bm{U}_t^T \bm{G}_t\bm{V}_t)\bm{V}_t^T -  \bm{Y}_1 \bm{V}_t^T - \bm{U}_t \bm{Y}_2^T.
% \end{aligned}
% $$
% Let \(\bm{Y}_1=\bm{Q}_1 \bm{R}_1\) and \(\bm{Y}_2=\bm{Q}_2 \bm{R}_2\) be the QR factorizations of \(\bm{Y}_1\) and \(\bm{Y}_2\) respectively. Then we have \(\bm{U}_t^T \bm{Q}_1=\bm{0}, \bm{V}_t^T \bm{Q}_2=\bm{0}\) and \(\bm{W}_t\) can be rewritten as
% $$
% \begin{aligned}
%     \bm{W}_t &= \bm{U}_t \bm{\Lambda}_t \bm{V}_t^T -  \bm{Q}_1 \bm{R}_1  \bm{V}_t^T - \bm{U}_t  \bm{R}_2^T \bm{Q}_2^T\\
%      & =\left[\begin{array}{ll}
%         \bm{U}_t & \bm{Q}_1
%         \end{array}\right]\left[\begin{array}{cc}
%         \bm{\Lambda}_t - \bm{U}_t^* \bm{G}_t \bm{V}_t & \bm{R}_2^T \\
%         \bm{R}_1 & 0
%         \end{array}\right]\left[\begin{array}{c}
%         \bm{V}_t^T \\
%         \bm{Q}_2^T
%         \end{array}\right] \\
%         & :=\left[\begin{array}{ll}
%         \bm{U}_t & \bm{Q}_1
%         \end{array}\right] \bm{M}_t\left[\begin{array}{c}
%         \bm{V}_t^T \\
%         \bm{Q}_2^T
%         \end{array}\right],
% \end{aligned}
% $$
% where $\bm{M}_t$ is a $2 r \times 2 r$ matrix. Since $\left[\begin{array}{ll}\bm{U}_t & \bm{Q}_2\end{array}\right]$ and $\left[\begin{array}{ll}\bm{V}_t & \bm{Q}_1\end{array}\right]$ are both orthogonal matrices, the SVD of $\bm{W}_t$ can be obtained from the SVD of $\bm{M}_t$, which can be computed using $O\left(r^3\right)$ floating point operations (flops) instead of $O\left(d_1^3\right)$ flops. 
%The gradient descent stage iteratively updates the optimization variable $\mathbf{X}_t$ by projecting the gradient of the loss function onto the tangent space of $\bm{X}_t$ and then applying the hard thresholding operator to enforce the rank constraint.  
%The algorithm represents the rank-$r$ matrix as an element on the Riemannian manifold embedded in $\domain$, whose dimension is $(d_1+d_2-r)r$. %\citep{vandereycken_low-rank_2013_RGD}
%The algorithm consists of two stages: spectral initialization and gradient descent. 
%The spectral initialization computes the truncated eigendecomposition of the data matrix $\mathbf{D}:=\mathcal{A}^*(\RHS)=\frac{1}{\sqrt{m}} \sum_{i=1}^m y_i \mathbf{A}_i$ and initializes the optimization variable $\mathbf{X}_0$ as the hard thresholding of $\mathbf{D}$. 
%To describe the algorithm more formally, we introduce some notation. Let \(\bm{X} \in \domain\) be a rank-$r$ matrix.
%Let \(\bm{X}= \bm{U} \bm{\Lambda} \bm{V}^T\) be the reduced singular value decomposition with \(\bm{U} \in \mathbb{R}^{d_1 \times r}, \bm{\Lambda}  \in \mathbb{R}^{r \times r}\), $\bm{V} \in  \mathbb{R}^{d_2 \times r}$. We define the tangent space at \(\bm{X}\) as
%    $$
%    \mathcal{T}_{\bm{X}} \domain:=\left\{\bm{Z} \in \mathbb{R}^{d_1 \times d_2}: \bm{Z}=\bm{U} \bm{R}^T + \bm{L} \bm{V}^T \text { with } \bm{R} \in \mathbb{R}^{d_2 \times r},  \bm{L} \in \mathbb{R}^{d_1 \times r}\right\}.
%    $$
%   We use hard thresholding, a type of retraction operator that maps a matrix to the closest matrix of rank \(r\): \(\mathcal{H}_r(\cdot)\) computes the singular value decomposition of a matrix and then sets all but the \(r\) largest singular values to zero, \(\mathcal{H}_r(\bm{Z}):=\bm{U}_r \bm{\Sigma}_r \bm{V}_r^T \quad\) where \(\quad \bm{\Sigma}_r(i, i):= \begin{cases}\bm{\Sigma}(i, i) & i \leq r \\ 0 & i>r .\end{cases}\).
\subsection{Main Result}
The main result of this paper provides a recovery guarantee for Algorithm \ref{alg: RGD} with optimal sample complexity. We first define the condition number of $\mathbf{X}_{\star}$ as
\[
\kappa := \frac{\left\|\mathbf{X}_{\star}\right\|_2}{\sigma_{\min}\left(\mathbf{X}_{\star}\right)},
\]
where $\|\cdot\|_2$ is the spectral norm (also called 2-norm) for matrices, and $\sigma_{\min}\left(\mathbf{X}_{\star}\right) := \sigma_r(\target)$ is the smallest non-zero singular value of $\target$. We call $\mathcal{A}$ a Gaussian measurement operator when the measurement matrices $\{\bm{A}_i\}_{i=1}^{m}$ in \eqref{eq: operator A} have i.i.d. entries drawn from $\mathcal{N}(0,1)$. Our main theorem is stated as follows:
\begin{thm}
    Let $\mathcal{A}$ be a Gaussian measurement operator.
    Let $\mathbf{X}_{\star} \in \domain$ be a rank-$r$ matrix and $\RHS = \mathcal{A}\left(\mathbf{X}_{\star}\right) \in \mathbb{R}^m$. Let $\{\bm{X}_t\}_{t\in\mathbb{N}}$ be the sequence generated by Algorithm \ref{alg: RGD} with step size $\mu = 1$. Then, for any $\rho \in (0,1)$, there exists a constant $C$ depending only on $\rho$ such that: if the number of measurements $m$ satisfies
    \[
    m \geq C \kappa^2 r (d_1 + d_2),
    \]
    with probability at least $1 - 7 \exp(-(d_1 + d_2))$, it holds for all iterations $t \geq 0$ that
    \begin{equation}\label{eq: mainconv}
    \|\bm{X}_t - \target\|_F \leq \sqrt{2r} \rho^{t} \sigma_{\min}\left(\mathbf{X}_{\star}\right).
    \end{equation}
    \label{thm: RGDmain}
\end{thm}
%Most existing theoretical guarantees for non-convex algorithms require a sample complexity of $m = \Omega(\kappa^2 r^2 (d_1 + d_2))$, which is suboptimal with respect to $r$. Although \citep{stoger_non-convex_2024} achieves a sample complexity of $m = \Omega(\kappa^2 r d_1)$, it relies on the stringent assumption of positive semidefinite (PSD) matrices and suffers from slow convergence in ill-conditioned cases. In contrast, o
The proof of the theorem is deferred to \cref{sec: Proof main}. Our result attains optimal sample complexity and high computational efficiency. The key advantages of our approach are:
\begin{itemize}
    %\item The constant $C$ depends only on $\rho$, enabling optimal sample complexity without requiring the PSD assumption in \citep{stoger_non-convex_2024}.
    \item The constant $C$ in \cref{thm: RGDmain} depends only on the convergence rate $\rho$, which allows our result to achieve optimal sample complexity $m = \Omega(\kappa^2 r (d_1 + d_2))$. Importantly, this result does not require the positive semidefinite (PSD) assumption on $\target$, which is a key limitation in \citep{stoger_non-convex_2024}. Their work relies on the PSD structure to derive a sample complexity of $m = \Omega(\kappa^2 r d_1)$, restricting its applicability to PSD matrices. By contrast, our approach applies to general rectangular matrices, significantly broadening the scope of problems that can be addressed. This generality, combined with optimal sample complexity, underscores the versatility and strength of our method.
    \item The convergence rate $\rho$ in \cref{thm: RGDmain} can be made arbitrarily small by choosing a sufficiently large $C$. Thus, our method achieves $\varepsilon$-accuracy for $\|\bm{X}_t - \target\|_F$ in $O\left(\log \left( \sqrt{r} \sigma_{\min}\left(\mathbf{X}_{\star}\right) \varepsilon^{-1} \right)\right)$ iterations. In contrast, the step size in \citep{stoger_non-convex_2024} is $O((\kappa \|\target\|_2)^{-1})$, leading to a convergence rate of $1 - O(\kappa^{-2})$. This results in $O\left(\kappa^2 \log \left( \sqrt{r} \sigma_{\min}\left(\mathbf{X}_{\star}\right) \varepsilon^{-1} \right)\right)$ iterations to achieve $\varepsilon$-accuracy for $\|\bm{L}_t \bm{L}_t^T - \target\|_F$, where $\bm{L}_t \bm{L}_t^T$ corresponds to $\bm{X}_t$ in our setting. Our method is significantly more efficient, particularly for ill-conditioned matrices.
\end{itemize}



