\documentclass{article}

\usepackage{neurips_2024}

\usepackage{caption}
\usepackage{subcaption}

\usepackage{graphicx}  
\usepackage{nicefrac}   
\usepackage{float}      
\usepackage{amsmath}    
\usepackage{amsthm}     
\usepackage{amsfonts}   
\usepackage{siunitx}    
\usepackage{url}        
\usepackage{xcolor}     
\usepackage{booktabs}   
\usepackage{microtype}  
\usepackage{hyperref}   

\usepackage{amssymb} 

\newcommand{\R}{\mathbb{R}}
\newcommand{\be}{\begin{equation}}
\newcommand{\ee}{\end{equation}}
\def \st {\operatorname*{subject\ to\ }}
\newcommand{\Mcal}{\mathcal{M}}
\newcommand{\Pcal}{\mathcal{P}}
\DeclareMathOperator*{\argmin}{\text{argmin}}
\DeclareMathOperator{\diag}{diag}
\newcommand{\calP}{\mathcal{P}}
\newcommand{\iprod}[2]{\left \langle #1, #2 \right \rangle }
\usepackage[algo2e, vlined,ruled]{algorithm2e}
\newcommand{\grad}{{\rm grad}}

\newtheorem{thm}{Theorem}
\newtheorem{lem}{Lemma}
\newtheorem{cor}{Corollary}
\newtheorem{prop}{Proposition}
\newtheorem{defi}{Definition}
\newtheorem{eg}{Example}
\newtheorem{prob}{Problem}
\newtheorem{obs}{Observation}
\newtheorem{assum}{Assumption}
\newtheorem{cond}{Condition}
\newtheorem{update}{Updating rule}
\newtheorem{fact}{Fact}
\newcommand{\revise}[1]{{\color{blue}#1}}

\title{Retraction-free optimization over the Stiefel manifold with application to the LoRA fine-tuning}

\author{%
  Yuan Zhang \\
  Peking University \\ 
  \texttt{zy1002@stu.pku.edu.cn}
  \And
  Jiang Hu\thanks{Corresponding author} \\
  University of California, Berkeley \\
  \texttt{hujiangopt@gmail.com}
  \And
  Jiaxi Cui \\
  DataTager \\
  \texttt{jiaxicui446@gmail.com}
  \And 
  Lin Lin \\
  University of California, Berkeley \\ \texttt{linlin@math.berkeley.edu} \And
  Zaiwen Wen \\ 
  Peking University\\
  \texttt{wenzw@pku.edu.cn} \And 
  Quanzheng Li \\
  Massachusetts General Hospital and Harvard Medical School\\
\texttt{li.quanzheng@mgh.harvard.edu}
}
% 在这里添加脚注的内容
% \renewcommand{\thefootnote}{\fnsymbol{footnote}}


\begin{document}

\maketitle

% \footnotetext[1]{Center for Data Science, Peking University, China}
% \footnotetext[2]{Massachusetts General Hospital and Harvard Medical School, Boston, MA 02114, USA}
% \footnotetext[3]{DataTager}
% \footnotetext[4]{Corresponding author}
% \footnotetext[5]{Beijing International Center for Mathematical Research, Center for Machine
% Learning Research and Changsha Institute for Computing and Digital Economy, Beijing, China}

\begin{abstract}
  Optimization over the Stiefel manifold has played a significant role in various machine learning tasks. Many existing algorithms either use the retraction operator to keep each iterate staying on the manifold, or solve an unconstrained quadratic penalized problem. The retraction operator in the former corresponds to orthonormalization of matrices and can be computationally costly for large-scale matrices. The latter approach usually equips with an unknown large penalty parameter. To address the above issues, we propose a retraction-free and penalty parameter-free algorithm, which lands on the manifold. A key component of the analysis is the convex-like property of the quadratic penalty of the Stiefel manifold, which enables us to explicitly characterize the penalty parameter. As an application, we introduce a new algorithm, Manifold-LoRA, which employs the landing technique and a carefully designed step size strategy to accelerate low-rank adaptation (LoRA) in fine-tuning large language models. Numerical experiments on the benchmark datasets demonstrate the efficiency of our proposed method. 
\end{abstract}


\section{Introduction}
Optimization over the Stiefel manifold has attracted considerable attention in the context of machine learning, e.g., RNN \cite{arjovsky2016unitary}, batch normalization \cite{cho2017riemannian}, and  distributionally robust optimization \cite{chen2017robust}. The mathematical formulation of this class of problems is:
\be \label{prob} \min_{X \in \R^{d\times r}} \;\; f(X)\;\; \st \;\; X \in {\rm St}(d,r):=\{ X\in \R^{d\times r}: X^\top X = I_d \}, \ee
where $r \leq d$ and $f:\R^{d\times r} \rightarrow \R$ is a continuously differentiable function. The most popular methods for solving \eqref{prob} are retraction-based algorithms, which have been extensively studied in the context of manifold optimization \cite{absil2008optimization,hu2020brief,boumal2023introduction}. Recently, to alleviate the possible computational burden of the retraction operator, some retraction-free methods have been developed in \cite{gao2018new,gao2022orthogonalization,xiao2024dissolving,ablin2022fast}. The ideas in these papers are based on a combination of the manifold geometry and a penalty function for the manifold constraint, which involves an unknown but sufficiently large penalty parameter. For large-scale machine learning applications, retraction-free algorithms are preferred. However, designing retraction-free algorithms with a known penalty parameter for solving \eqref{prob} remains a challenge.

Another motivation for studying retraction-free methods arises from its application in the fine-tuning of large language models (LLMs). Recently, LLMs have revolutionized the field of natural language processing (NLP), achieving unprecedented performance across various applications \cite{radford2019language, qin2023chatgpt}. To tailor pretrained LLMs for specific downstream tasks, the most common approach is full fine-tuning, which requires prohibitively large computational resources due to the need to adapt all model weights, hindering the deployment of large models. As a result, parameter-efficient fine-tuning (PEFT) has gained widespread attention for requiring few trainable parameters while delivering comparable or even superior results to full fine-tuning. This paradigm involves inserting learnable modules or designating only a small portion of weights as trainable, keeping the main model frozen \cite{houlsby2019parameter, li2021prefix, zaken2021bitfit}. Among fine-tuning methods, low-rank adaptation (LoRA) \cite{hu2021lora} has become the de facto standard among parameter-efficient fine-tuning techniques. It assumes that the change in weights lies in a ``low intrinsic dimension'', thereby modelling the update $\Delta W \in \R^{d\times m}$ by two low-rank (not greater than a small integer $r$) matrices $A \in \R^{r\times m}$ and $B\in \R^{d\times r}$, i.e., $\Delta W = BA$. Since $r \ll d$, the requirements on both storage and computation are significantly reduced. Due to its decompositional nature, there is redundancy in the representation of $\Delta W$. Traditional optimization methods for LoRA do not exploit this redundancy, which consequently undermines model performance. Instead, we reformulate LoRA fine-tuning as an optimization problem over the product of Stiefel manifolds and Euclidean spaces. Therefore, we propose an algorithmic framework called Manifold-LoRA to accelerate the fine-tuning process and enhance model performance. Moreover, by exploiting projected gradients and incorporating a parameter-free penalty, the overhead that our method incurs is relatively negligible. Our contributions are as follows:
\begin{itemize}
    \item We first prove the existence of explicit choice for the penalty parameter by establishing a strong convexity-like condition of the nonconvex penalty problem associated with the Stiefel manifold constraint. Furthermore, for the given penalty parameter, under mild conditions, we prove that the iterates of our proposed retraction-free gradient descent method eventually land on the Stiefel manifold and achieve the optimality of \eqref{prob}. 
    \item Building upon the established landing theory of retraction-free and penalty parameter-free method and the AdamW framework, we proposed a new method, Manifold-LoRA, which employs a carefully designed step size strategy to accelerate the training process of fine-tuning. Compared with the conventional AdamW method, we use the penalized gradient instead of the usual gradient, and the computational overhead is negligible.

    
    \item Numerical experiments are conducted on a wide range of NLP tasks, demonstrating the efficiency of our algorithm. Specifically, compared to the vanilla LoRA, our Manifold-LoRA with half the trainable parameters not only delivers fast convergence but also yields improved generalization. In particular, Our method converges twice as fast as baseline methods on several typical datasets, including the SQuAD 2.0 dataset and the CoLA dataset.
\end{itemize}
% Fine-tuning with LoRA has been a popular routine to combine the large pretrained models and downstream tasks with domain-specific datasets.  

% One possible drawback in the current LoRA framework is that the low-rank decomposition $\Delta W$ into product $BA$ is not unique. Specifically, for any invertible matrix $C$, it holds that $BA = (BC)(C^{-1} A)$. Note that $BC$ shares the same column space with $B$. This suggests us optimizing the subspace generated by $B$ instead of $B$ itself. Plenty of researches in the field of low-rank optimization, e.g., \cite{boumal2011rtrmc,dai2011subspace,dai2012geometric}, investigate the manifold geometry of the low-rank decomposition and develop efficient algorithms. However, such geometry has not been explored in the LoRA fine-tuning. 

% However, even if the specified rank $r$ is much smaller than model dimension $d$, we found that there is some redundancy in \textcolor{red}{\textbf{we need add some claims of low rank optimization ?}} \\
% Based on 
\subsection{Related Work} 
\textbf{Optimization over the Stiefel manifold.} Optimization over the Stiefel manifold has attracted lots of attention due to its broad applications. Through the use of retraction, known as the generalization of the exponential map, the Riemannian gradient descent is proposed \cite{absil2008optimization,boumal2023introduction,hu2020brief}, where all iterates lie on the manifold. When such retraction is computationally costly, the authors \cite{gao2018new} develop a retraction-free algorithm based on the augmented Lagrangian method. More recently, by defining the constraint dissolving operator and adding a sufficiently large penalty term, the authors \cite{xiao2024dissolving} convert the manifold constrained problem \eqref{prob} into an unconstrained problem and then apply unconstrained optimization algorithms. In \cite{ablin2022fast}, motivated by the convergence of the Oja's flow,  a landing flow, consisting of the projected gradient and the gradient of the penalty function, is developed to retraction-free method for the squared Stiefel manifold, i.e., $d=r$. All of these methods rely on an unknown penalty parameter to ensure the convergence. This motivates us to design penalty parameter-free algorithms, which could significantly reduce the need for tuning parameters in practical implementations. 

\textbf{LoRA.} There are numerous variants of LoRA aiming to improve performance or reduce memory usage. AdaLoRA \cite{zhang2023adaptive}, a well-known successor, introduces the idea of adaptively adjusting the rank of different layers by incorporating an additional vector $\boldsymbol{g}$ to serve as the diagonal of a singular value matrix. This approach leverages a revised sensitivity-based importance measure to decide whether to disable entries in vector $\boldsymbol{g}$ and in matrices $A$ and $B$. A similar work, SoRA \cite{ding2023sparse}, adopts the same model architecture as AdaLoRA, but proposes a different way to update vector $\boldsymbol{g}$ after training. This update rule is the proximal gradient of $\mathcal{L}_1$ loss, acting as a post-pruning method. Additionally, a recently emerged method called VeRA \cite{kopiczko2023vera} significantly reduces memory overhead while maintaining competitive performance. Based on the idea that networks with random initialization contain subnetworks that are near-optimal or optimal \cite{frankle2018lottery}, VeRA only uses two frozen low-rank matrices shared by all layers, training scaling vectors unique to each layer. Although LoRA has gained significant popularity and various variants have been developed, the potential for efficient training through leveraging the manifold geometry to reduce redundancy has not been well-explored.

\subsection{Notation} For a matrix $X \in \R^{d\times r}$, we use $\|X\|$ to denote its Frobenius norm. For a squared matrix $A \in \R^{d \times d}$, we define ${\rm sym}(A) = \frac{A + A^\top}{2}$ and use ${\rm diag}(A) \in \R^d$ to denote its diagonal part. For two matrices $X, Y \in \R^{d\times r}$, we use $\iprod{X}{Y}:=\sum_{i=1}^d \sum_{j=1}^r X_{ij}Y_{ij}$ to denote their Euclidean inner product. For a differential function $f:\R^{d\times r} \rightarrow d$, we use $\nabla f(X)$ to denote its Euclidean gradient at $X$. 

% \textbf{Manifold Learning} 

% \textbf{Low-rank Optimization}

\section{Retraction-free and penalty parameter-free optimization over the Stiefel manifold} \label{sec:landing}
In this section, we focus on the design of retraction-free and penalty parameter-free algorithms for solving problem \eqref{prob}. We will first present the retraction-free algorithm and then show how the penalty parameter can be explicitly determined by characterizing the landscape of the penalty function. 

\subsection{Retraction-free algorithms}
Inspired by the retraction-free algorithms \cite{gao2018new, xiao2024dissolving, ablin2022fast}, we consider the following retraction-free gradient descent method for problem \eqref{prob}:
\be \label{eq:grad-it} X_{k+1} = X_k - \alpha \grad f(X_k) - \mu X_k(X_k^\top X_k - I_d), \ee
where $\alpha, \mu >0$ are step sizes and the projected gradient $\grad f(X_k) := \nabla f(X_k) - X_k{\rm sym}(X_k^\top \nabla f(X_k))$. Note that the tangent space of ${\rm St}(d,r)$ is $T_{X_k} {\rm St}(d,r) := \{ \xi \in \R^{d\times r}: X_k^\top \xi + \xi^\top X_k = 0\}$. Then, for $X_k \in {\rm St}(d,r)$, $\grad f(X_k)$ is the projection of the Euclidean gradient $\nabla f(X_k)$ to the tangent space, i.e., $\grad f(X_k) = \Pcal_{T_{X_k} {\rm St}(d,r)} (\nabla f(X_k))$. Note that the term $X_k(X_k^\top X_k - I_d)$ is exactly the gradient of the following quadratic penalty function
\[ \varphi(X) := \frac{1}{4}\|X^\top X - I\|^2. \] 
As will be shown in our theorem, the use of the projected gradient is essential for landing on the manifold. This differs with the usual penalty method, which optimizes $f(X) + \mu \varphi(X)$ using the update $X_{k+1} = X_k - \alpha \nabla f(X_k) - \mu X_k(X_k^\top X_k - I_d)$, needs $\mu \rightarrow \infty$ to guarantee the feasibility.  


% The landing algorithms for optimization over the Stiefel manifold have been investigated in \cite{ablin2022fast,gao2022optimization}. The analysis therein focuses on the continuous counterpart of the algorithms and is  unclear how to generalize to the practical setting. 
\subsection{Explicit choice for the penalty parameter}
It is known that a large penalty parameter yields better feasibility \cite[Chapter 17]{nocedal1999numerical}. To make the iterative scheme \eqref{eq:grad-it} be penalty parameter-free, we need a careful investigation on the landscape of the following optimization problem: 
\be \label{prob:feasi}  \min_{X\in \R^{d\times r}} \;\; \varphi(X). \ee
It can be easily verified that problem \eqref{prob:feasi} is nonconvex and its 
the optimal solution set is ${\rm St}(d,r)$. The key of obtaining an explicit formula of $\mu$ is to establish certain strong convexity-type inequality and show the gradient descent method with step size $\mu$ has linear convergence.  

For any $X \in {\rm St}(d,r)$, let us denote $\bar{X}:= \calP_{{\rm St}(d,r)}(X)$. Let $X = USV^\top$ be the singular value decomposition with orthogonal matrices $U\in \R^{d\times r}, V \in \R^{d\times d}$ and diagonal matrix $S \in \R^{d\times d}$, then $\bar{X} = UV^\top$. Building on these notations, we demonstrate that problem \eqref{prob:feasi} satisfies the restrict secant inequality (RSI) \cite{zhang2013gradient}, which serves as an alternative to the strong convexity in the linear convergence analysis of gradient-type methods. 
\begin{lem} \label{lem:rsi}
    For any $X \in \R^{d\times r}$ with $\| X - \bar{X} \| \leq \frac{1}{8}$, we have
    \be \label{eq:rsi} \iprod{\nabla \varphi(X)}{X - \bar{X}} \geq \| X - \bar{X} \|^2. \ee
\end{lem}

With the above RSI, we have the linear convergence of the gradient descent update for \eqref{prob:feasi}, i.e.,
\be \label{eq:grad-feasi} X_{k+1} = X_k - \mu \nabla \varphi(X_k). \ee
\begin{lem}  \label{lem:linear-feasi}
    Let the sequence $\{X_k\}$ be generated by \eqref{eq:grad-feasi} with $\mu = \frac{1}{3}$. Suppose that $\| X_0 - \bar{X}_0 \| \leq \frac{1}{8}$. We have
    \be \label{eq:grad-feasi-linear} \|X_{k+1} - \bar{X}_{k+1}\|^2 \leq \frac{2}{3}\|X_k - \bar{X}_k \|^2. \ee
\end{lem}

The proofs of Lemmas \ref{lem:rsi} and \ref{lem:linear-feasi} can be found in Appendix \ref{appendix:proof}.
% \begin{proof}
%     Assume that $\|X_k - \bar{X}_{k} \| \leq 1$. Denote the SVD of $X_k$ by $USV^\top$. Let $s = {\rm diag}(S)$. Then, we have $0 \leq s_i \leq 2$ for any $i$. This implies 
%     \be \label{eq:grad-bound} \|\nabla \varphi(X_k)\|^2 = {\rm tr}((S^3 - S)^2) \leq 16 \|X_k -\bar{X}_k\|^2. \ee
%     Hence, we have
%     \[ \begin{aligned}
%         \|X_{k+1} - \bar{X}_{k+1}\|^2 & \leq  \| X_{k+1} - \bar{X}_{k} \|^2 \\
%         & = \|X_k - \frac{1}{16}\nabla \varphi(X_k) - \bar{X}_k \|^2 \\
%         & = \| X_k -  \bar{X}_k \|^2 - \frac{1}{8}\iprod{X_k -  \bar{X}_k}{\nabla \varphi(X_k)} + \frac{1}{16^2}\| \nabla \varphi(X_k) \|^2 \\
%         & \leq (1 - \frac{1}{8} + \frac{1}{16^2} \times 16) \|X_k - \bar{X}_k\|^2 \\
%         & = \frac{15}{16}\|X_k - \bar{X}_k\|^2,
%     \end{aligned} \]
%     where the first inequality is from $\bar{X}_{k+1} = \argmin_{X \in {\rm St}(d,r)} \|X - X_k\|^2$ and the second inequality is due to Lemma \ref{lem:rsi} and \eqref{eq:grad-bound}.
% \end{proof}
\subsection{Landing on the Stiefel manifold}
Building on the established linear convergence of gradient descent for problem \eqref{prob:feasi}, we are now able to show that the iterates generated by \eqref{eq:grad-it} will land on the Stiefel manifold eventually, and the limiting point is a stationary point of \eqref{prob}, i.e., $\grad f(X_\infty) = 0$. 

% We are going to show the convergence of the gradient descent method \eqref{eq:grad-it} with $\beta = \frac{1}{16}$. 
Let us start with the Lipschitz continuity of $\grad f(X)$. For any $X \in \bar{U}_{{\rm St}(d,r)}(\frac{1}{8})$, we define $\Pcal_{T_X {\rm St}(d,r)}(U) = U - X {\rm sym}(X^\top U)$ for $U \in \R^{d\times r}$. We first have the following quadratic upper bound on $f$ from its twice differentiability and the compactness of ${\rm St}(d,r)$. 
\begin{lem} 
    There exists a constant $L > 0$ such that for any $X, Y\in {\rm St}(d,r)$, the following quadratic upper bound holds:
    \be \label{eq:qub} f(Y) \leq f(X) + \iprod{\grad f(X)}{Y-X} + \frac{L}{2} \|Y-X\|^2. \ee
    In addition, there exists a constant $\hat{L} > 0$ such that for any $X \in {\rm St}(d,r), Y \in U_{\Mcal}(\frac{1}{8})$, 
    \be \label{eq:grad-lip} \| \grad f(X) - \grad f(Y) \| \leq \hat{L} \|X - Y\|. \ee
\end{lem}
% \begin{proof}
%     Due to the twice differentiability  of $f$ and the compactness of ${\rm St}(d,r)$, the inequality \eqref{eq:qub} directly follows from \cite[Lemma 2.4]{chen2021decentralized} and \cite[Lemma 4.2]{deng2023decentralized}, where $L:= L_f + D_f$ with $L_f$ being the Lipschitz constant of $\nabla f(X)$ over ${\rm St}(d,r)$ and $D_f:= \max_{X \in {\rm St}(d,r)} \|\nabla f(X)\|$. 

%     For the second argument, we have
%     \[ \begin{aligned}
%         & \| \grad f(X) - \grad f(Y) \| \\
%         \leq & \| \Pcal_{T_{X}{\rm St}(d,r)}(\nabla f(X)) - \Pcal_{T_{X}{\rm St}(d,r)}(\nabla f(Y)) \| + \| \Pcal_{T_{X}{\rm St}(d,r)}(\nabla f(Y)) - \grad f(Y) \| \\
%         \leq & L_f\|X-Y\| + \frac{1}{2}\| X(X^\top \nabla f(Y) + \nabla f(Y)^\top X) - Y(Y^\top \nabla f(Y) + \nabla f(Y)^\top Y) \| \\
%         \leq & L_f\|X -Y\| + \frac{1}{2}\| X((X-Y)^\top \nabla f(Y) + \nabla f(Y)^\top (X-Y))\| \\
%         & + \| (X-Y)(Y^\top \nabla f(Y) + \nabla f(Y)^\top Y) \| \\
%         \leq & L_f \|X-Y\| + \frac{1}{2}(2\hat{D}_f + 3\hat{D}_f) \|X -Y\| \\
%         = & (L_f + \frac{5}{2}\hat{D}_f) \|X - Y\|,
%     \end{aligned} \]
%     where $\hat{D}_f:= \max_{X \in \bar{U}_{{\rm St}(d,r)}(\frac{1}{2})} \|\nabla f(X)\|$,
%     the second inequality is due to the contractive property of $\Pcal_{T_{X}{\rm St}(d,r)}$,  and the last inequality is from the fact that $\|Y\|_2 \leq \frac{3}{2}$ . By setting $\hat{L} = L_f + \frac{5}{2}\hat{D}_f$, we complete the proof.
% \end{proof}


By the linear convergence result in Lemma \ref{lem:linear-feasi}, we have the following bound on the feasibility error.
\begin{lem} \label{lem:err-feasi}
    Let $\{X_k\}$ be the sequence generated by \eqref{eq:grad-it} with $\mu = \frac{1}{3}$ and $\| X_0 - \bar{X}_0 \| \leq \frac{1}{8}$. We have
    \be \label{eq:err-feasi} \|X_{k+1} - \bar{X}_{k+1}\| \leq \frac{2}{3} \| X_k - \bar{X}_k \| + \alpha \|\grad f(X_k)\|. \ee
\end{lem}
% \begin{proof}
%     It follows that
%     \[ \begin{aligned}
%         \|X_{k+1} - \bar{X}_{k+1} \| & \leq \| X_{k+1} - \bar{X}_{k} \| \\
%         & \leq \| X_{k}  - \beta \varphi(X_k) - \bar{X}_{k} \| + \alpha \| \grad f(X_k)\| \\
%         & \leq \frac{15}{16} \| X_k - \bar{X}_k \| + \alpha \| \grad f(X_k)\|. 
%     \end{aligned} \]
%     We complete the proof.
% \end{proof}

The following one-step descent lemma on $f$ is crucial in establishing the convergence. 
\begin{lem} \label{lem:descent}
    Let $\{X_k\}$ be the sequence generated by \eqref{eq:grad-it} with $\mu = \frac{1}{3}$ and $\| X_0 - \bar{X}_0 \| \leq \frac{1}{8}$. We have
    \be\label{eq:descent} 
    \begin{aligned}
    f(\bar{X}_{k+1}) - f(\bar{X}_k) \leq & - (\alpha - (4\hat{L}^2 + 4L + 1) \alpha^2) \|\grad f(X_k)\|^2 + \frac{1}{2}\|X_{k+1} - \bar{X}_{k+1}\|^2 \\ 
    & + \frac{1}{2}\left(4\hat{D}_f + 16\hat{L}^2 + 16 L + 3 \right) \|X_k - \bar{X}_k\|^2. 
    \end{aligned}\ee
\end{lem}
% \begin{proof}
% It follows from \eqref{eq:qub} that
%     \be \label{eq:descent-ex} \begin{aligned} 
%      & f(\bar{X}_{k+1}) - f(\bar{X}_k) \leq  \iprod{\grad f(\bar{X}_k)}{\bar{X}_{k+1} - \bar{X}_k} + \frac{L}{2}\| \bar{X}_{k+1} - \bar{X}_k \|^2 \\
%     \leq & \iprod{\grad f(\bar{X}_k)}{\bar{X}_{k+1} - X_{k+1} + X_k- \bar{X}_k} + \iprod{\grad f(\bar{X}_k)}{X_{k+1} - X_k} \\
%     & + 2L \| X_{k+1} - X_k \|^2 \\
%     \leq &  \iprod{\grad f(\bar{X}_k)}{\bar{X}_{k+1} - X_{k+1}} + \iprod{\grad f(\bar{X}_k)}{X_{k+1} - X_k} \\
%     & + 4L (\alpha^2 \|\grad f(X_k)\|^2 + \beta^2 \|\nabla \varphi(X_k)\|^2) \\
%     = &  \iprod{\grad f(\bar{X}_k) - \grad f(\bar{X}_{k+1}) }{\bar{X}_{k+1} - X_{k+1}} + \iprod{\grad f(X_k) }{X_{k+1} - X_k} \\
%     & + \iprod{\grad f(\bar{X}_k) - \grad f(X_k)}{X_{k+1} - X_k} \\
%     & + 4L (\alpha^2 \|\grad f(X_k)\|^2 + \beta^2 \|\nabla \varphi(X_k)\|^2) \\
%     \leq & 2\hat{L}^2 \|X_{k+1} - X_k\|^2 + \frac{1}{2} \|X_{k+1} - \bar{X}_{k+1}\|^2 - \alpha \|\grad f(X_k)\|^2 \\
%     & - \beta \iprod{\grad f(X_k)}{\nabla \varphi(X_k)} + \frac{1}{2}(\hat{L}^2\|X_k -\bar{X}_k\|^2 + \|X_{k+1} - X_k\|^2) \\
%     & + 4L (\alpha^2 \|\grad f(X_k)\|^2 + \beta^2 \|\nabla \varphi(X_k)\|^2) \\
%     \leq & - \alpha \|\grad f(X_k)\|^2 - \beta \iprod{\nabla f(X_k)}{\calP_{T_{X_k} {\rm St}(d,r)} (\nabla \varphi(X_k))} + \frac{1}{2}\|X_{k+1} - \bar{X}_{k+1}\|^2 \\
%     & + \frac{1}{2}\|X_{k} - \bar{X}_{k}\|^2 + (4\hat{L}^2 + 4L + 1) ( \alpha^2 \| \grad f(X_k) \|^2 + \beta^2 \|\nabla \varphi(X_k)\|^2) \\
%     \leq & - (\alpha - (4\hat{L}^2 + 4L + 1) \alpha^2) \|\grad f(X_k)\|^2 + \frac{1}{2}\|X_{k+1} - \bar{X}_{k+1}\|^2 \\
%     & + (10 \beta \hat{D}_f + \frac{1}{2} + 16(4\hat{L}^2 + 4L + 1)\beta^2) \|X_k - \bar{X}_k\|^2,
%     \end{aligned} \ee
%     where the second inequality is from the 2-Lipschitz continuity of $\Pcal_{{\rm St}(d,r)}$ over $\bar{U}_{{\rm St}(d,r)}(\frac{1}{2})$, the third inequality is due to the facts that $X_k - \bar{X}_k \in N_{\bar{X}_k} {\rm St}(d,r)$ and $\iprod{A}{B} \leq \frac{1}{2}(\|A\|^2 + \|B\|^2)$ for any $A, B \in \R^{n \times d}$, and the last inequality comes from 
%     \[ \|\calP_{T_{X_k} {\rm St}(d,r)} (\nabla \varphi(X_k)) \| = \|X_k(X_k^\top X_k - I)^2\| \leq 10 \|X_k - \bar{X}_k\|^2. \]
%     Plugging $\beta = \frac{1}{16}$ into \eqref{eq:descent-ex} gives \eqref{eq:descent}.
% \end{proof}

From the above lemma, the one-step descrease on $f$ is related to both the gradient norm of $f$ and the feasibility error. In terms of convergence, we need both $\grad f(X_k)$ and $\|X_k^\top X_k - I\|$ converge to 0. The following theorem demonstrates that the retraction-free and penalty parameter-free update \eqref{eq:grad-it} converges. 
% This can be guaranteed by combining Lemmas \ref{lem:linear-feasi} and \ref{lem:descent}. 

\begin{thm}
    Let $\{X_k\}$ be the sequence generated by \eqref{eq:grad-it} with $\mu = \frac{1}{3}$ and $\| X_0 - \bar{X}_0 \| \leq \frac{1}{8}$. If the step size $\alpha < \frac{1}{2c_1}$ for some $c_1$ large enough, then
    we have 
    \be \label{eq:complexity} \min_{k =0, \ldots, K} \|\grad f(X_k)\|^2 \leq \frac{1}{K}, \quad \min_{k=0, \ldots, K} \|X_k^\top X_k - I\|^2 \leq \frac{1}{K}.  \ee
\end{thm}

The proofs of the above lemmas and theorem are presented in Appendix \ref{appendix:proof}.


% \subsection{GLoRA}
% Instead of the pure low-rank decomposition of $\Delta W$ with two smaller-scaled matrices $A$ and $B$, GLoRA utilizes the following representation of the tunable space, 
% \be \label{eq:glora}
% f(x)=\left(\mathbf{W}_0+\mathbf{W}_0 \mathbf{A}+\mathbf{B}\right) x+\mathbf{C} \mathbf{W}_0+\mathbf{D} \mathbf{b}_0+\mathbf{E}+\mathbf{b}_0
% \ee
% where $\mathbf{A}, \mathbf{B}, \mathbf{C}, \mathbf{D}, \mathbf{E}$ are the trainable support tensors for downstream tasks in our GLoRA, $\mathbf{W}_0$ and $\mathbf{b}_0$ are frozen during whole fine-tuning. $\mathbf{A}$ is utilized to scale the weight. $\mathbf{B}$ has the role to scale the input and shift the weight. $\mathbf{C}$ is the layer-wise prompt serving a similar function of VPT-Deep, $\mathbf{D}$ and $\mathbf{E}$ are used to scale and shift the bias, respectively. 

% However, the formula \eqref{eq:glora} is not compact and we can further improve it by using 
% \be \label{eq:glora}
% f(x)=\left(\mathbf{W}_0+ {\color{blue} \mathbf{F}\mathbf{W}_0 + } \mathbf{W}_0 \mathbf{A}+\mathbf{B}\right) x+\mathbf{C} \mathbf{W}_0+\mathbf{D} \mathbf{b}_0+\mathbf{E}+\mathbf{b}_0,
% \ee
% where $\mathbf{G}$ is the new supporting tensor. 

% \[ A = A_d \otimes A_u \]

% Experiments on VTAB-1K and ImageNet

% % \subsubsection{Manifold identification}
% % The nonsmooth $\ell_{2,1}$ regularization can give preferred properties, i.e., row sparsity, for $A$. As the $\ell_{2,1}$-norm function is partly smooth, we are able to use the tool of manifold identification from nonlinear optimization to show that the iterates possess a row sparse structure. The learned parameter with identified structure can achieve better generalization ability. 


% % % Directions:

% % % \begin{itemize}
% % %     \item Modifications on second-order momentum of AdamW by referring the paper:
    
% % %     AdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights

% % %     \item Using a large $r$ and adding row sparsity on $A$ (i.e., the $L_{2,1}$ norm of $A$) to detect the low rankness.

% % %     \item Employ second-order information for the acceleration.

% % % \end{itemize}



% % % \subsection{Fine-tuning with low-rank tensors}

% % % Briefly speaking, LoRA uses the low-rank matrix factorization to accelerate the fine-tuning of LLMs. There are also many other techniques, including low-rank tensor factorization and Kronecker-product factorization, for saving the storage of parameters and fast computations on the matrix-vector multiplications, see, e.g., \cite{kressnerlow,zhang2017subspace,cichocki2014tensor,tropp2019streaming}.




% % % Questions:
% % % \begin{itemize}
% % %     \item Can we use more subspace optimization technique and manifold geometry? The low-rank factorization is an example to update the parameters in low-dimensional space.  
% % % \end{itemize}



\section{Accelerate LoRA fine-tuning with landing}
\label{sec: accelerate}
In this section, we will first clarify where the Stiefel manifold constraint comes from in the LoRA fine-tuning. Then, we will apply the above developed retraction-free and penalty parameter-free method to enhance LoRA fine-tuning. 

% we will apply the developed landing theory to accelerate the LoRA fine-tuning. The flow is as follows: We first clarify why the LoRA fine-tuning is an optimization over the Stiefel manfiold and then apply the retraction-free and penalty parameter-free method to train it. 
\subsection{Manifold optimization formulation of LoRA fine-tuning}
% Let us briefly introduce the idea of LoRA \cite{hu2021lora}.\revise{can delete} 
In neural networks, the dense layers perform matrix multiplication, and the weight matrices in these layers usually have a full rank. However, when adapting to a specific task, pre-trained language models have been shown to have a low intrinsic dimension, allowing them to learn efficiently even with a random projection to a smaller subspace.
% Building on this idea, the authors in LoRA \cite{hu2021lora} hypothesize that the updates to the weights during adaptation also have a low intrinsic rank. Denote the pre-trained weight matrix by $W_0$. The update in the LoRA fine-tuning is given by  
% \[  W = W_0 + \Delta W = W_0 + BA, \]
% where $W \in \R^{n\times m}, B \in \R^{d\times r}, A \in \R^{d \times m}$, and the rank $d \ll \min(n,m)$.  
% Due to the low rankness, the computation of matrix-vector multiplication is reduced from $\mathcal{O}(nm)$ of $\Delta W x$ to $\mathcal{O}(d(n+m))$ of $BAx$. During training, $W_0$ is frozen and does not receive gradient updates, while $A$ and $B$ are updated. For the initialization, they use a random Gaussian initialization for $A$ and set $B$ as zero.
One possible drawback in the current LoRA fine-tuning framework is that the low-rank decomposition $\Delta W$ into product $BA$ is not unique. Specifically, for any invertible matrix $C$, it holds that $BA = (BC)(C^{-1} A)$. Note that $BC$ shares the same column space with $B$. This suggests us optimizing the subspace generated by $B$ instead of $B$ itself. Numerous studies in the field of low-rank optimization, e.g., \cite{boumal2011rtrmc,dai2011subspace,dai2012geometric}, investigate the manifold geometry of the low-rank decomposition and develop efficient algorithms. However, such geometry has not been explored in the LoRA fine-tuning. 

To address such redundancy (i.e., the non-uniqueness of $BA$ representations), we regard $B$ as the basis through the manifold constraint and $A$ as the coordinate of $\Delta W$ under $B$. Hence, the optimization problem can be formulated as 
\be \label{prob:man} \min_{A,B} \quad L(BA), \quad \st \quad B \in {\rm St}(d,r) {\rm ~or~} B \in {\rm Ob}(d,r), \ee
where ${\rm Ob}(d,r):=\{B \in \R^{d\times r}: {\rm diag}(B^\top B) =\mathbf{1}\}$. Compared to the Stiefel manifold ${\rm St}(d,r)$, the oblique manifold ${\rm Ob}(d,r)$ necessitates that the matrix $B$ has unit norms in its columns, without imposing requirements for orthogonality between the columns. Problem \eqref{prob:man} is an optimization problem over the product of manifolds and Euclidean spaces. 

% To remove the influence of the rank $r$, they give a step size for $\Delta W$ and do the update
% \[ W = W_0 + \frac{\alpha}{r} BA, \]
% where $\alpha$ is a constant. 
% \begin{figure}[!htp]
%     \centering
%     \includegraphics[width = 0.4\textwidth]{fig/LoRA-exp.png}
% \end{figure}


% Although the best rank $r$ is generally  unknown, the numerical results show that a fixed and small $r$ can give satisfactory results compared with other adapter methods.

\subsection{Manifold-LoRA}
% In the last decades, manifold optimization has been a hot topic, and finds great applications in machine learning, signal/image processing, and computational sciences. Let us first connect LoRA with manifold optimization.\revise{spoekn} Basically, LoRA wants to update the parameter $W$ in a low rank space, which is represented by two smaller scaled matrices $B,A$. For an update $\Delta W$ with rank $r$, this kind of representation may not be unique, namely, for any $C = {\rm diag}(c_1, c_2, \ldots, c_r)$ with $c_i \ne 0$, 
% \[ BA = BCC^{-1}A = \tilde{B} \tilde{A}, \]
% where $\tilde{B} = BC \in \R^{d\times r}$ and $\tilde{A} = C^{-1}A \in \R^{r \times k}$. Motivating by the orthogonal dictionary learning \cite{tovsic2011dictionary}, we can regard the matrix $B$ as the basis, hence is with unit column norms, and $A$ as the representation of $\Delta W$ under $B$. In this way, the new optimization model takes the form
% \be \label{prob:man} \min_{A,B} \quad L(A,B), \quad \mbox{s.t.} \quad B \in {\rm St}(d,r) {\rm ~or~} B \in {\rm Ob}(d,r), \ee
% where ${\rm St}(d,r):=\{B \in \R^{d\times r}: B^\top B = I\}$ and ${\rm Ob}(d,r):=\{B \in \R^{d\times r}: {\rm diag}(B^\top B) =1\}$. 

% To solve \eqref{prob:man}, there are two type of approaches: retraction-based and retraction-free algorithms. The former one is to use the well-developed retraction operator from the manifold geometry to keep every iterate is strictly feasible. For the latter, since the retraction operator is not used, the iterate may not be feasible, but will asymptotically feasible from a good initialization in the continuous limit. One significant benefit of retraction-free method is to avoid possibly prohibitive computations of the retraction operator (e.g., the projection to a manifold).
% presenting its exact convergence in the discrete setting.
The retraction-free method is well-suited to address \eqref{prob:man}, simultaneously minimizing the loss function $L(BA)$ and constraint violation. To control the constraint violation, we use the quadratic penalties $R_s(B):=\|B^\top B - I\|^2$ and $R_o(B):=\|{\rm diag}(B^\top B) - 1\|^2$ for the Stiefel manifold and oblique manifold, respectively. As shown in the landing theory in Section \ref{sec:landing}, we shall use the projected gradient of the loss part instead of the Euclidean gradient. For the Stiefel manifold and the oblique manifold, the respective projected gradients are
\be \label{eq:grad1} \grad_B L(BA)= \nabla_B L(BA) - B{\rm sym}(B^\top \nabla_B L(BA)) \ee
and 
\be \label{eq:grad2} \grad_B L(BA) = \nabla_B L(BA) - B{\rm diag}({\rm diag}(B^\top \nabla_B L(BA))), \ee
where ${\rm sym}(X):= (X + X^\top) /2$. 
Thus, the gradients of our retraction-free method for $A$ and $B$ are $\nabla_A L(BA)$ and $\grad_B L(BA) + \mu \nabla R_s(B)({\rm ~or~}\nabla R_o(B))$. 

Note that $B$ and $A$ represent the basis and the coordinate of $\Delta W$, respectively. This results in different magnitudes and different Lipschitz constants of their gradient function. In fact, let $X = BA$. It follows
\[ \nabla_A L(BA) = B^\top\nabla_X L(X), \quad \nabla_B L(BA) = \nabla_X L(X) A^\top. \]
Then, 
\[ \begin{aligned}
    \|\nabla_A L(BA_1) - \nabla L(BA_2) \| & \leq \|B\|_2 L_g \|A_1 - A_2\|, \\
    \|\nabla_B L(B_1A) - \nabla L(B_2A) \| & \leq \|A\|_2 L_g \|B_1 - B_2\|,
\end{aligned}
   \]
   where $L_g$ is the Lipschitz constant of $\nabla_X L(X)$ and $\|\cdot \|_2$ represent the matrix $\ell_2$ norm (i.e., the largest singular value). Note that the step size generally should be propositional to the reciprocal of Lipschitz constant for the gradient type algorithms \cite{nocedal1999numerical,bottou2018optimization}. Hence, we schedule the learning rates for the two matrices based on their respective $\ell_2$ norms. Having prepared the above, we incorporate the AdamW optimizer \cite{loshchilov2018decoupled} with our manifold-accelerated technique to enhance the LoRA fine-tuning, as presented in Algorithm~\ref{alg:manlora}.

% Due to the asymmetric in updating for $A$ and $B$, resulting in distinct Lipschitz constants, we schedule the learning rates for the two matrices based on their respective Frobenius \revise{$L_2$} norms. Having prepared the above, we incorporate the AdamW optimizer with our manifold-accelerated technique to enhance the training during fine-tuning, as presented in Algorithm~\ref{alg:manlora}.

% In this work, we are focusing on the retraction-free method for solving \eqref{prob:man}. Moreover, its exact convergence in the discrete setting will also be presented. The retraction-free method jointly minimize the loss function $L(A,B)$ and the constraint violation. To control the feasibility, we use the quadratic penalties $R_s(B):=\|B^\top B - I\|^2$ and $R_o(B):=\|{\rm diag}(B^\top B) - 1\|^2$ for the Stiefel manifold and oblique manifold, respectively. Another key component in establishing the landing theory is the use of Riemannian gradient of the loss part instead of Euclidean gradient. For the Stiefel manifold and the oblique manifold, the respective Riemannian gradients are 
% \[  \grad_B L(A,B)= \nabla_B L(A,B) - B{\rm sym}(B^\top \nabla_B L(A,B)) \]
% and 
% \[  \grad_B L(A,B) = \nabla_B L(A,B) - B{\rm diag}({\rm diag}(B^\top \nabla_B L(A, B))), \]
% where ${\rm sym}(X):= (X + X^\top) /2$. 

% Having the above preparation, the gradients of our retraction-free method for $A$ and $B$ are $\nabla_A L(A,B)$ and $\grad_B L(A,B) + \lambda \nabla_B R_s(B)({\rm ~or~}\nabla R_o(B))$. \revise{In consideration of different Lipschitz constant for $A$ and $B$ induced by the norms themselves, choose different step size.}


% We use the stochastic version of these gradients and apply the AdamW method, presenting in Algorithm \ref{alg:manlora}.
% This is an unconstrained optimization problem. To solve it, we can employ the popular gradient based stochastic optimization methods, e.g., SGD and AdamW. The key component of implementing these algorithms is the expression of the gradient. For our problem \eqref{eq:man-lora}, one natural way is to use the usual Euclidean gradient. Since $B$ will approximately lie in the manifold, we can also the Riemannian gradient instead of the Euclidean one for $B$, namely,
% \[ \nabla_B L(A,B) - B{\rm sym}(B^\top \nabla_B L(A,B))  + 4 \lambda B(B^\top B - I),  \]
% where 


% we need the gradients of $L$ with respect to $A$ and $B$, which is the same as in LoRA. Once the gradients are obtained, we perform the usual gradient descent on $A$ but Riemannian gradient descent on $B$ since the Riemannian manifold constraint ${\rm Ob}(d,r)$ is involved.

% To solve \eqref{prob:man}, we can use Riemannian stochastic optimization over manifold. Specifically, in the $k$-th iteration,


\begin{algorithm2e}[t]
\small
\caption{Manifold-LoRA}
\SetAlgoLined
\DontPrintSemicolon  % To not print semicolon at the end of each line
\SetKwInput{KwInput}{Input} % Set the Input
\KwInput{Initial point $A_0, B_0$, $\mu \in \mathbb{R}$, $\beta_1 = 0.9$, $\beta_2 = 0.999$, $upper\_bound \geq lower\_bound > 0$, $\epsilon = 10^{-8}$, $\gamma > 0$, $\lambda \in \mathbb{R}$, and $k = 0$.}
\While{Stopping conditions not met}{
    \For{$C \in \{A, B\}$}{
        \If{$C = B$}{
            Set $g(C_k)$ according to \eqref{eq:grad1} or \eqref{eq:grad2} using the stochastic estimate of $\nabla_B L(B_kA_k)$ \tcp*{Projected gradient for matrix $B$}
        \Else{
            Set $g(C_k)$ to be the stochastic estimate of $\nabla_A L(B_kA_k)$}
        }
        % Compute $g_t(C_k)$ based on $\nabla_{C_k} L(C_k)$\;
        % \If{$C = B$}{
        %     $g_t(C_k) \gets g_t(C_k) + \mu \nabla_{C_k} R_s(C_k) (\text{or \;} \nabla_{C_k} R_o(C_k)) $ \tcp*{Riemannian gradient for matrix B}
        % }
        $m(C_k) \gets \beta_1 m(C_k) + (1 - \beta_1) g(C_k)$\;
        $v(C_k) \gets \beta_2 v(C_k) + (1 - \beta_2) g_t^2(C_k)$\;
        $\hat{m}(C_k) \gets \frac{m(C_k)}{1 - \beta_1^t}$\;
        $\hat{v}(C_k) \gets \frac{v(C_k)}{1 - \beta_2^t}$\;
        $\eta(C_k) \gets clip(\text{norm}_{C_k}, upper\_bound, lower\_bound)$\; \tcp*{Scheduling step size of matrix A and B}
        $C_k \gets C_{k-1} - \eta_t(C_k) \left(\hat{m}_t(C_k) / \left(\sqrt{\hat{v}_t(C_k)} + \epsilon\right)\right) - \lambda C_{k-1}$\;
        \If{$C = B$}{
            $C_k \gets C_k - \mu \nabla R_s(C_k)({\rm ~or~}\nabla R_o(C_k)) $ \tcp*{ Apply penalty gradient for matrix $B$}
        }
    }
    $k \gets k + 1$\;
}
\label{alg:manlora}
\end{algorithm2e}
% \subsection{Adding sparsity regularization on $A$}
% Note that the matrix $A$ can be explained as the coefficient of $\Delta W$ under the basis $B$. For a large $r$, this basis maybe redundant and the associated coefficient of $\Delta W$ may be sparse. If a small $r$ is chosen, different columns of the desired $\Delta W$ may not lie in the rank $r$ space. Motivated by this, we can set a relative large $r$ and add a sparsity regularizer on $A$ for the rank adaptivity. In addition, such sparsity of $A$ also leads to efficient matrix-vector multiplication $Ax$. 

% \subsection{Manifold-regularized LoRA}
% When the hard constraint, e.g., the Stiefel manifold constraint for $B$, is added to the parameter, keeping feasibility of the algorithmic iterates is generally difficult. To achieve similar function with cheap computational cost, we consider the following penalized formulation of the manifold constraint, 
% \be \label{eq:man-lora} \min \tilde{L}(A,B) := L(A,B) + \lambda \| B^\top B - I \|_F^2.  \ee
% This is an unconstrained optimization problem. To solve it, we can employ the popular gradient based stochastic optimization methods, e.g., SGD and AdamW. The key component of implementing these algorithms is the expression of the gradient. For our problem \eqref{eq:man-lora}, one natural way is to use the usual Euclidean gradient. Since $B$ will approximately lie in the manifold, we can also the Riemannian gradient instead of the Euclidean one for $B$, namely,
% \[ \nabla_B L(A,B) - B{\rm sym}(B^\top \nabla_B L(A,B))  + 4 \lambda B(B^\top B - I),  \]
% where ${\rm sym}(X):= (X + X^\top) /2$. 

% \subsection{Second-order acceleration}

% \subsection{Soft constraint on the spectrum of $B$ and row-sparsity constraint on $A$}

% As the hard constraint for $B$, i.e., the oblique manifold, may be costly for algorithmic design, an alternative approach is to use the spectrum constraint on $B$, namely, changing the loss function to
% \[ L(A,B) + \lambda \| B^\top B - I \|_F^2,  \]
% where $\lambda$ is a hyperparameter. With this regularization term, $B$ is expected to yield the properties of an orthogonal basis of the low-dimension subspace. Then, we may further add the row-sparsity regularization on $A$, which can be explained as the coefficients of $\Delta W$ under the basis $B$. Specifically, 

% For $A = [a_1, \ldots, a_r]$, $\|A\|_{2,1}:=\sum_{i=1}^r \|a_i\|_2$. 

% Note that the $\ell_{2,1}$ norm can be seen as the $\ell_1$ norm on the $\ell_2$ norms of $r$ rows of $A$. By adding the $\ell_{2,1}$ norm on $A$, we construct the loss as
% \be \label{eq:loss}  L(A,B) + \lambda \| B^\top B - I \|_F^2 + \mu \|A\|_{2,1}, \ee
% where $\mu$ is a hyperparameter to control the sparsity of $A$. 

% To minimize \eqref{eq:loss}, one can perform the following update
% \[ \begin{aligned}
%     A_{k+1} & = {\rm prox}_{\mu \|\cdot \|_{2,1}}(A_k - \nabla_A L(A_k, B_k)), \\
%     B_{k+1} & = B_k - \eta_k (\nabla_B L(A_k,B_k) - B_k{\rm sym}(B_k^\top \nabla_B L(A_k,B_k))  + 4 \lambda B_k(B_k^\top B_k - I)).
% \end{aligned} \]
% where 
% \[ \operatorname{prox}_{\mu \|\cdot\|_{2,1}}\left(a_j\right)=\left(1-\frac{\mu}{\max \left\{\left\|a_j\right\|_2, \mu \right\}}\right) a_j \quad \forall j=1, \ldots, r. \]


\section{Experiments}
\label{section:exp}
In this section, we delve into the experimental results and their detailed analysis. This discussion is structured around two principal areas: (1) the performance gain compared to other mainstream fine-tuning methods and accelerated convergence achieved through our manifold-constrained optimization approach; (2) the convergence of matrix $B$ onto the manifold, illustrated by the heat map of $B^\top B$. 

\textbf{Baselines} We compare our approach against several baseline methods, including full fine-tuning,  Adapter \cite{houlsby2019parameter}, BitFit \cite{zaken2021bitfit} and LoRA \cite{hu2021lora}. The variants of the Adapter method are excluded from the baselines, as their performance are relatively similar.

\textbf{Implementation Details} Our code is based on Pytorch \cite{paszke2019pytorch}, Huggingface Transformers \cite{wolf-etal-2020-transformers} and an open-source plug-and-play library for parameter-efficient fine-tuning opendelta \cite{hu2023opendelta}.  The bottleneck dimension for the Adapter is set to 16 or 32, ensuring that the number of trainable parameters aligns closely with that of the LoRA method and the new layers are inserted into  the attention layer and feed-forward layer. The update of LoRA is scaled by a hyper-parameter $\alpha$. This value is typically left unmodified, as it is usually set as 16 or 32 and never tuned \cite{hu2021lora, yang2020feature}. The exponential moving average parameters $\beta_1$ and $\beta_2$ of AdamW \cite{loshchilov2017decoupled} are set to their default values of 0.9 and 0.999, respectively. All the experiments are conducted on NVIDIA A800 GPUs. More details are presented in Appendix~\ref{appendix:hyper-params}.

\subsection{Natural language understanding}
We first evaluate our backbone model DeBERTaV3-base \cite{he2021debertav3} on GLUE \cite{wang2018glue} benchmark containing nine sub datasets, including MNLI \cite{williams2017broad}, SST-2 \cite{socher2013recursive}, CoLA \cite{warstadt2019neural}, QQP \cite{wang2018glue}, QNLI \cite{rajpurkar2016squad}, RTE \cite{bentivogli2009fifth}, MRPC \cite{dolan2005automatically},  and STS-B \cite{wang2018glue}. 

Experimental results of the GLUE dataset are recorded in Table \ref{tab:glue}. It can be seen that our method is consistently superior to other baselines. Notably, for RTE and STS-B datasets, both sphere-constrained (i.e., oblique manifold-constrained) and Stiefel-constrained have an obvious performance gain even with only half the trainable parameters compared to the LoRA baseline, i.e., Sphere$_{r=8}$ and Stiefel$_{r=8}$ beat LoRA$_{r=16}$. In addition, with the help of manifold geometry, the fine-tuning process can be significantly accelerated compared to the vanilla AdamW optimizer, achieving a lower training loss, as shown in Figure \ref{fig:glue-images}. Particularly on the CoLA dataset presented in Figure \ref{fig:sub-1}, our approach achieves the same training loss as the standard Adam optimizer but requires nearly half the number of epochs.

\subsection{Question Answering}
We conduct an evaluation on two question answering datasets: SQuAD v1.1 \cite{rajpurkar2016squad} and SQuADv2.0 \cite{rajpurkar2018know}.  Manifold-LoRA is used to fine-tune DeBERTaV3-base for these tasks, which are treated as sequence labeling problems predicting the probability of each token as the start or end of an answer span. 
 
The main experimental results are presented in Table \ref{tab:squad}. For LoRA and our algorithms, new layers are inserted into $W_q, W_k, W_v, W_o, FC_1, FC_2$. Notably, both manifold-regularized LoRA variants consistently outperform all fine-tuning methods. Additionally, we plot the training loss, evaluation exact match, and evaluation F1 scores  against epochs in Figure \ref{fig:squadv2}. We conclude that the proposed Manifold-LoRA method achieves a 2x speed-up in training epochs compared to AdamW, while simultaneously improving model performance. We also illustrate the heat map of $B^\top B$ in Figure \ref{fig:squadv2-heatmap}, which indicates that the matrix $B$ lands on the manifold eventually. This supports our assertion that landing on manifold enhances the performance of LoRA.

\subsection{Natural Language Generation}
The E2E NLG Challenge\cite{novikova2017e2e}, as introduced by Novikova, provides a dataset for training end-to-end, data-driven natural language generation systems, widely used in data-to-text evaluations. The E2E dataset comprises approximately 42,000 training examples, 4,600 validation examples, and 4,600 test examples, all from the restaurant domain. We test our method on the E2E dataset using GPT-2 Medium and Large models, following the experimental setup outlined by LoRA \cite{hu2021lora}. For LoRA, we set the hyperparameters to match those specified in the original paper. 

The results from the E2E dataset are recorded in Table~\ref{tab:e2e}, where we focus on comparing LoRA and Manifold-LoRA. The results clearly indicate that our proposed algorithm outperforms the established baselines. Also, as shown in Figure~\ref{fig:e2e-heatmap}, the matrix $B$ resides on the manifold even at the early training stage, validating the feasibility of our method.

\begin{table}[!t]
  \centering
  \small
  \caption{Results with DeBERTaV3-base on GLUE benchmark. We denote the best results in \textbf{bold}.}
  \label{tab:glue}
  \resizebox{\textwidth}{!}{
  \begin{tabular}{
  p{0.6cm}
  S[table-format=3.2]  % Updated this line
  S[table-format=2.2]@{/}S[table-format=2.2]
  S[table-format=2.2]
  S[table-format=2.2]
  S[table-format=2.2]@{/}S[table-format=2.2]
  S[table-format=2.2]
  S[table-format=2.2]
  S[table-format=2.2]
  S[table-format=2.2]
  S[table-format=2.2]
}
   \toprule
    {Method} & { \# Params} & \multicolumn{2}{c}{MNLI} & {SST-2} & {CoLA} & \multicolumn{2}{c}{QQP} & {QNLI} & {RTE} & {MRPC} & {STS-B} & {All} \\
    {} & {} & {m} & {mm} & {Acc} & {Mcc} & {Acc} & {F1} & {Acc} & {Acc} & {Acc} & {Corr} & {Ave.} \\
    \midrule
    \small{Full FT} &  184.42M & 90.45 & \textbf{90.60} & 95.48 & 68.17 & \textbf{91.99} & \textbf{89.12} & 93.60 & 79.28 & 88.93 & 90.92 & 87.85 \\
    \small{Adapter} & 0.61M & 90.13 & 90.16 & 94.86 & 69.37 & 91.38 & 88.46 & 93.54  & 81.87 & 89.12 & 91.52 & 88.06\\
    \small{BitFit} & 0.06M & 87.08 & 86.39 & 94.88 & 69.11 & 87.96 & 84.35 & 92.19 & 76.52 & 87.06 & 90.96 & 85.65 \\
    \small{LoRA\(_{r=8}\)} & 0.30M & 90.20 & 90.08 & 94.93 & 68.14 & 90.78 & 87.68 & 93.85 & 80.15 & 90.40 & 90.29 & 87.60 \\
    \small{LoRA\(_{r=16}\)} & 0.59M & 90.44 & 90.12 & 95.41 & 68.19 & 90.92 & 87.77 & 94.00 & 80.58 & 90.20 & 90.34 & 87.74\\
    \small{Sphere\(_{r=8}\)} & 0.30M & 90.37 & 90.09 & 95.48 & 69.55 & 91.25 & 88.34 & 94.02 & 82.44 & 91.55 & 91.26 & 88.44 \\
    \small{Sphere\(_{r=16}\)} & 0.59M & \textbf{90.52} & 90.19 & 95.64 & \textbf{70.14} & 91.46 & 88.65 & \textbf{94.29} & 82.16 & \textbf{91.67} & \textbf{91.59} & \textbf{88.63} \\
    \small{Stiefel\(_{r=8}\)} & 0.30M & 90.25 & 89.99 & 95.46 & 69.85 & 91.44 & 88.60 & 94.09 & \textbf{83.16} & 91.18 & 91.22 & 88.52\\
    \small{Stiefel\(_{r=16}\)} & 0.59M & 90.26 & 90.28 & \textbf{95.76} & 68.92 & 91.71 & 89.00 & 94.10 & 82.16 & 91.10 & 91.51 & 88.48
\end{tabular}
}
\end{table}

\begin{figure}[!htp]
\small
\centering
% 第一行图像
\begin{subfigure}[t]{0.30\textwidth}
    \centering
    \includegraphics[width=\linewidth, height=4cm, keepaspectratio]{glue/cola_train_loss.pdf}
    \caption{CoLA train loss}
    \label{fig:sub-1}
\end{subfigure}
\hfill
\begin{subfigure}[t]{0.30\textwidth}
    \centering
    \includegraphics[width=\linewidth, height=4cm, keepaspectratio]{glue/qqp_loss.pdf}
    \caption{QQP train loss}
    \label{fig:sub-2}
\end{subfigure}
\hfill
\begin{subfigure}[t]{0.30\textwidth}
    \centering
    \includegraphics[width=\linewidth, height=4cm, keepaspectratio]{glue/stsb_train_loss.pdf}
    \caption{STSB train loss}
    \label{fig:sub-3}
\end{subfigure}
\hfill
\vspace{0.5em}
% 第二行图像
\begin{subfigure}[t]{0.30\textwidth}
    \centering
    \includegraphics[width=\linewidth, ]{glue/cola_eval_mcc.pdf}
    \caption{CoLA evaluation matthews correlation}
    \label{fig:sub-4}
\end{subfigure}
\hfill
\begin{subfigure}[t]{0.30\textwidth}
    \centering
    \includegraphics[width=\linewidth]{glue/qqp_eval_accs.pdf}
    \caption{QQP evaluation accuracy}
    \label{fig:sub-5}
\end{subfigure}
\hfill
\begin{subfigure}[t]{0.30\textwidth}
    \centering
    \includegraphics[width=\linewidth]{glue/stsb_eval_pearson.pdf}
    \caption{STSB evaluation pearson}
    \label{fig:sub-6}
\end{subfigure}
\hfill
\caption{The figures illustrate that both sphere constrained and Stiefel constrained manifold-LoRA achieve a faster convergence rate and attain a lower training loss within same optimization steps compared to LoRA method on three distinct datasets CoLA, QQP, STSB. }
\label{fig:glue-images}
\end{figure}

\begin{figure}[!htp]
\centering
\small
\begin{subfigure}[t]{0.32\textwidth}
    \centering
    \includegraphics[width=\linewidth, height=4cm, keepaspectratio]{squad/squad2_loss.pdf}
    \caption{SQuADv2.0 Train Loss}
    \label{fig:squad-1}
\end{subfigure}
\hfill
\begin{subfigure}[t]{0.32\textwidth}
    \centering
    \includegraphics[width=\linewidth, height=4cm, keepaspectratio]{squad/squad2_match.pdf}
    \caption{SQuADv2.0 Eval Exact Match}
    \label{fig:squad-2}
\end{subfigure}
\hfill
\begin{subfigure}[t]{0.32\textwidth}
    \centering
    \includegraphics[width=\linewidth, height=4cm, keepaspectratio]{squad/squad2_f1.pdf}
    \caption{SQuADv2.0 Eval F1}
    \label{fig:squad-3}
\end{subfigure}
\caption{The figures compare the training loss, evaluation exact match, and evaluation F1 metrics against the number of epochs for the SQuADv2.0 dataset.}
\label{fig:squadv2}
\end{figure}

% to ensure that the matrix $B$ lands on the manifold after certain epochs in Figure \ref{fig:squadv2-heatmap}.
\begin{table}[!t]
\centering
\small
\caption{Results with DeBERTaV3-base on SQuAD v1.1 and SQuADv2.0. We report EM/F1. The best results in each setting are shown in \textbf{bold}.}
\begin{tabular}{cccc}
\toprule
Methods & Params & SQuADv1.1 & SQuADv2.0 \\
\midrule
Full FT & 184M & 86.30 / 92.85 & 84.30 / 87.58 \\ 
Adapter\(_{r=16}\) & 0.61M & 87.46 / 93.41 & 85.30 / 88.23 \\
Adapter\(_{r=32}\) & 1.22M & 87.53 / 93.51 & 85.42 / 88.36 \\
Bitfit & 0.07M & 80.26 / 88.79 & 74.21 / 87.19 \\
LoRA\(_{r=8}\) & 1.33M & 87.90 / 93.88 & 85.56 / 88.52\\
LoRA\(_{r=16}\) & 2.65M & 87.94 / 93.75 & 85.90 / 88.81\\
Sphere\(_{r=8}\) & 1.33M & 88.51 / \textbf{94.25} & 86.33 / 89.20 \\
Sphere\(_{r=16}\) & 2.65M & 88.32 / 94.03 & 86.15 / 89.03 \\
Stiefel\(_{r=8}\) & 1.33M & \textbf{88.68} / 94.23 & 86.35 / 89.09 \\
Stiefel\(_{r=16}\) & 2.65M & 88.25 / 94.04 & \textbf{86.41} / \textbf{89.22} \\ 
\bottomrule
\label{tab:squad}
\end{tabular}
\end{table}

\begin{figure}[!htp]
\small
\centering
\begin{subfigure}[t]{0.75\textwidth}
    \centering
    \includegraphics[width=\linewidth, height=4cm, keepaspectratio]{squad/squad2-Stefiel.pdf}
    % \caption{}
    \label{fig:squad2-Stiefel}
\end{subfigure}

\begin{subfigure}[t]{0.75\textwidth}
    \centering
    \includegraphics[width=\linewidth, height=4cm, keepaspectratio]{squad/squad2-sphere.pdf}

    \label{fig:squad2-sphere}
\end{subfigure}
\caption{The heat map of $B^\top B$ with the Stiefel manifold (the first and second rows) and the oblique manifold (the third and fourth rows) at the end  of training on SQuADv2.0 dataset.}
\label{fig:squadv2-heatmap}
\end{figure}

\begin{table}[!t]
\small
  \centering
  \caption{GPT-2 medium (M) and large (L) models were evaluated on the E2E NLG Challenge. * denotes results from previously published works.}
  \label{tab:e2e}
  \begin{tabular}{ccccccc}
  \toprule
    Model  &  Parameters & BLEU & NIST & MET & ROUGE-L & CIDEr \\
    \midrule
    GPT-2 M (FT)* & 354.92M & 68.2 & 8.62 & 46.2 & 71.0 & 2.47\\
    GPT-2 M (Adapter\textsuperscript{L})* & 0.37M & 66.3 & 8.41 & 45.0 & 69.8 & 2.40 \\
    GPT-2 M (Adapter\textsuperscript{L})* & 11.09M & 68.9 & 8.71 & 46.1 & 71.3 & 2.47 \\
    GPT-2 M (Adapter\textsuperscript{H})* & 11.09M & $67.3_{\pm .6}$  & $8.50_{\pm .07}$ & $46.0_{\pm .2}$ & $70.7_{\pm .2}$ & $2.44_{\pm .01}$ \\
    GPT-2 M (FT\textsuperscript{Top2})* & 25.19M & 68.1 & 8.59 & 46.0 & 70.8 & 2.41 \\
    GPT-2 M (PreLayer)* & 0.35M & 69.7 & 8.81 & 46.1 & 71.4 & 2.49\\
    GPT-2 M (LoRA) & 0.35M & 68.9  & 8.69 & 46.5 & 71.5 & 2.51\\
    % GPT-2 M (LoRA, r=8) & 0.70M & $69.4$  & $8.73$ & $46.6$ & $71.6$ & $2.52$\\
    GPT-2 M(Stiefel) & 0.35M & 70.1 & 8.82 & \textbf{46.8} & \textbf{71.7} & \textbf{2.53} \\
    % GPT-2 M(Stiefel, r=8) & 0.70M & 70.0 & 8.80 & 46.8 & \textbf{72.2} & 2.52 \\
    GPT-2 M(Sphere) & 0.35M & \textbf{70.3} & \textbf{8.83} & 46.7 & \textbf{71.7} & 2.52 \\
    % GPT-2 M(Sphere, r=8) & 0.70M & 70.0 & \textbf{8.81} & \textbf{46.9} & 71.8 & \textbf{2.53} \\
    \midrule
    GPT-2 L (FT)* & 774.03M & 68.5 & 8.78 & 46.0  & 69.9 & 2.45\\
    GPT-2 L (Adapter\textsuperscript{L})* & 0.88M & $69.1_{\pm .1}$  & $8.68_{\pm .03}$ & $46.3_{\pm .0}$ & $71.4_{\pm .2}$ & $2.49_{\pm .0}$ \\
    GPT-2 L (Adapter\textsuperscript{L})* & 23.00M & $68.9_{\pm .3}$  & $8.70_{\pm .04}$ & $46.1_{\pm .1}$ & $71.3_{\pm .2}$ & $2.45_{\pm .02}$ \\
    GPT-2 L (PreLayer)* & 0.77M & 70.3 & 8.85 & 46.2 & 71.7 & 2.47 \\
    GPT-2 L (LoRA) & 0.77M & 70.1  & 8.82 & 46.7 & 72.0 &
    2.53 \\
    % GPT-2 L (LoRA, r=8) & 1.54M & $69.5$  & $8.77$ & $46.7$ & $71.4$ &
    % 2.51 \\
    GPT-2 L(Stiefel) & 0.77M & 70.4 & 8.86 & \textbf{46.8} & 72.1 & 2.53 \\
    GPT-2 L(Sphere) & 0.77M & \textbf{70.9} & \textbf{8.92} & \textbf{46.8} & \textbf{72.5} & \textbf{2.55} \\
  \bottomrule
  \end{tabular}
\end{table}
\begin{figure}
    \begin{subfigure}[t]{0.499\textwidth}
        \centering
        \includegraphics[scale=0.22]{e2e/GPT-2-L-Stefiel.pdf}
        \label{fig:e2e-Stiefel}
    \end{subfigure}
    \hfill
    \begin{subfigure}[t]{0.499\textwidth}
        \centering
        \includegraphics[scale=0.22]{e2e/GPT-2-L-sphere.pdf}
        \label{fig:e2e-sphere}
    \end{subfigure}
    \caption{The heat map of $B^\top B$ with the Stiefel manifold (left) and the oblique manifold (right) on E2E dataset.}
    \label{fig:e2e-heatmap}
\end{figure}
% \section{Convergence analysis}
% We are concerned with optimization over the Stiefel manifold:
% \be \label{prob} \min_{X \in \R^{d\times r}} \;\; f(X)\;\; \st \;\; X \in {\rm St}(d,r):=\{ X\in \R^{d\times r}: X^\top X = I_d \}, \ee
% where $d \leq n$ and $f:\R^{d\times r} \rightarrow \R$ is a twice continuously differentiable function. 

% \subsection{Quadratic penalty of the Stiefel manifold}

\section{Conclusion}
\label{section:conlusion}
Optimization over the Stiefel manifold has been widely used in machine learning tasks. In this work, we develop a retraction-free and penalty parameter-free gradient method, and prove that the generated iterates eventually land on the manifold and achieve the optimality simultaneously. We then apply this landing theory to avoid the possible redundancy of LoRA fine-tuning in LLMs. Specifically, we reformulate the LoRA fine-tuning as an optimization problem over the Stiefel manifold, and propose a new algorithm, Manifold-LoRA, which incorporates a careful analysis of step sizes to enable fast training using the landing properties. Extensive experimental results demonstrate that our approach not only accelerates the training process but also yields significant performance improvements.

Our study suggests several potential directions for future research. Although the established landing theory focuses on the Stiefel manifold, extending this theory to general manifolds is one potential direction. Additionally, evaluating the performance of Manifold-LoRA on LLMs with billions of parameters would be valuable. Due to the heterogeneity of different layers, incorporating adaptive ranks for $\Delta W$ across different layers is another possible direction. This may be achievable by adding sparsity regularization to the coordinate matrix $A$.



% It should be noted that our experiments did not extend to large language models with billions of parameters, such as LLaMa-2. Additionally, one can introduce adaptive rank for $\Delta W$ across different layers, the implementation could potentially be enhanced by incorporating sparsity regularization on matrix $A$.


% Building upon our landing theory, we introduce Manifold-LoRA, which constrains the matrix $B$ within a manifold without incurring additional computational costs associated with retraction. In the update of the matrix $B$, a Riemannian gradient is utilized in place of the conventional Euclidean gradient. 
% Furthermore, we dynamically adjust the learning rates of the two matrices based on their 2-Norms. Extensive experimental results have clearly demonstrated that our approach not only accelerates the training process during fine-tuning but also yields significant performance improvements.

% Different step sizes for $A$ and $B$.

% \section{Limitations and future work}
% In this section, we briefly discuss possible directions for future improvements. The current landing theory is focusing on the Stiefel manifold. Extending such theory to general manifolds is one future direction. It should be noted that our experiments did not extend to large language models with billions of parameters, such as LLaMa-2. Additionally, one can introduce adaptive rank for $\Delta W$ across different layers, the implementation could potentially be enhanced by incorporating sparsity regularization on matrix $A$. 

% {\bf Discussions.} Introduce adaptive rank for $\Delta W$ across different layers, this could be possibly achieved by adding certain sparsity regularization on $A$.

% The possibility of applying manifold regularization to other fine-tuning methods. 

% add limitations

% It should be noted that the experiments of our approach did not extend to large language models with billions parameters, such as LLaMa-2.

% Also, vision task not inclued in our experiments

% Several limitations merit future research. Firstly, the absence of closed-form solutions for the projection operator $\Pcal_{\Mcal}$ particularly for certain manifolds, necessitates exploring methods to calculate projection approximately. Additionally, the selection of step sizes relies on the proximal smoothness constant $\gamma$, underscoring the need for estimating $\gamma$ for specific manifolds. Furthermore, designing algorithms for partial participation and devising corresponding client-drift correction mechanisms require further investigation.
% \section{}
\newpage
\bibliographystyle{plain}

\bibliography{ref.bib}

\newpage
\appendix
\section{Proximal smoothness}
The notion of proximal smoothness, as introduced by \cite{clarke1995proximal}, refers to the characteristic of a closed set whereby the nearest-point projection becomes a singleton when the point is in close enough to the set.  This property facilitates algorithmic and theoretical advancements by endowing nonconvex sets with convex-like structural attributes. Specifically, for any positive real number $\gamma$, we define the $\gamma$-tube around $\mathcal{M}$ as $
 U_{\mathcal{M}}(\gamma): = \{x:{\rm dist}(x,\mathcal{M}) <  \gamma\}.$ 
We say a closed set $\mathcal{M}$ is $\gamma$-proximally smooth if the projection operator $\Pcal_{\mathcal{M}}(x):=\argmin_{y \in \Mcal} \|y -x\|^2$ is a singleton whenever $x\in U_{\Mcal}(\gamma)$. 

Obviously, any closed and convex set is proximally smooth for arbitrary $\gamma \in (0, \infty)$. According to \cite[Corollary 4.6]{clarke1995proximal}, a closed set $\Mcal$ is convex if and only if it is proximally smooth with a radius of $\gamma$ for every $\gamma > 0$.
It is worth noting that the Stiefel manifold is $1$-proximally smooth. By  following the proof in \cite[Theorem 4.8]{clarke1995proximal}, 
\be \label{eq:lip-proj-alpha}
\left\| \Pcal_{{\rm St}(d,r)} (x) -\Pcal_{{\rm St}(d,r)} (y)\right\| \leq 2 \|x - y\|,~~ \forall x,y \in \bar{U}_{{\rm St}(d,r)}(\frac{1}{2} ), 
\ee
where $\bar{U}_{{\rm St}(d,r)}(\frac{1}{2}):=\{x: {\rm dist}(x,{\rm St}(d,r)) \leq \frac{1}{2}\}$ is the closure of $U_{{\rm St}(d,r)}(\frac{1}{2})$. It is worth noting that for any closed convex set $\Mcal \subset \R^{d\times r}$, the projection operator $\Pcal_{\Mcal}$ is 1-Lipschitz continuous over $\R^{d\times r}$. 
% Additionally,  the inequality \eqref{eq:normal-bound} holds with $\iprod{v}{y-x} \leq 0$. Therefore, the inequalities \eqref{eq:lip-proj-alpha} and \eqref{eq:normal-bound} can be considered as  generalizations from the closed convex set to the proximally smooth set.
\section{Proofs} 
\label{appendix:proof}
\textbf{Proof of Lemma 1}
\begin{proof}
    Denote the SVD of $X$ by $X = USV^\top$. Then, it holds that ${\rm dist}(X, {\rm St}(d,r)) = \|X - \bar{X}\| =  \|s -1 \|_2$, where $s = \diag(S)$. Furthermore, we have
    \[ \begin{aligned}
        \iprod{\nabla \varphi(X)}{X - \bar{X}} & = \iprod{USV^\top(VS^2V^\top - I)}{USV^\top - UV^\top} \\
        & = \iprod{U(S^3-S)V^\top}{U(S-I)V^\top} \\
        & = {\rm tr}((S^3 - S) (S - I)) \\
        & \geq \frac{3}{2}\|s-1\|_2^2 = \frac{3}{2}\|X - \bar{X}\|^2,
    \end{aligned} \]
    where the last inequality is from $\min_{i} s_i(s_i+1) \geq \frac{105}{64} \geq \frac{3}{2}$.
    This completes the proof.
\end{proof}
\textbf{Proof of Lemma 2}
\begin{proof}
    Assume that $\|X_k - \bar{X}_{k} \| \leq \frac{1}{8}$. Denote the SVD of $X_k$ by $USV^\top$. Let $s = {\rm diag}(S)$. Then, we have $\frac{7}{8} \leq s_i \leq \frac{9}{8}$ for any $i$. This implies 
    \be \label{eq:grad-bound} \|\nabla \varphi(X_k)\|^2 = {\rm tr}((S^3 - S)^2) \leq 6 \|X_k -\bar{X}_k\|^2. \ee
    Hence, we have
    \[ \begin{aligned}
        \|X_{k+1} - \bar{X}_{k+1}\|^2 & \leq  \| X_{k+1} - \bar{X}_{k} \|^2 \\
        & = \|X_k - \frac{1}{3}\nabla \varphi(X_k) - \bar{X}_k \|^2 \\
        & = \| X_k -  \bar{X}_k \|^2 - \frac{2}{3}\iprod{X_k -  \bar{X}_k}{\nabla \varphi(X_k)} + \frac{1}{9}\| \nabla \varphi(X_k) \|^2 \\
        & \leq (1 - 1 + \frac{2}{3}) \|X_k - \bar{X}_k\|^2 \\
        & = \frac{2}{3}\|X_k - \bar{X}_k\|^2,
    \end{aligned} \]
    where the first inequality is from $\bar{X}_{k+1} = \argmin_{X \in {\rm St}(d,r)} \|X - X_k\|^2$ and the second inequality is due to Lemma \ref{lem:rsi} and \eqref{eq:grad-bound}.
\end{proof}
\textbf{Proof of Lemma 3}
\begin{proof}
    Due to the twice differentiability  of $f$ and the compactness of ${\rm St}(d,r)$, the inequality \eqref{eq:qub} directly follows from \cite[Lemma 2.4]{chen2021decentralized} and \cite[Lemma 4.2]{deng2023decentralized}, where $L:= L_f + D_f$ with $L_f$ being the Lipschitz constant of $\nabla f(X)$ over ${\rm St}(d,r)$ and $D_f:= \max_{X \in {\rm St}(d,r)} \|\nabla f(X)\|$. 

    For the second argument, we have
    \[ \begin{aligned}
        & \| \grad f(X) - \grad f(Y) \| \\
        \leq & \| \Pcal_{T_{X}{\rm St}(d,r)}(\nabla f(X)) - \Pcal_{T_{X}{\rm St}(d,r)}(\nabla f(Y)) \| + \| \Pcal_{T_{X}{\rm St}(d,r)}(\nabla f(Y)) - \grad f(Y) \| \\
        \leq & L_f\|X-Y\| + \frac{1}{2}\| X(X^\top \nabla f(Y) + \nabla f(Y)^\top X) - Y(Y^\top \nabla f(Y) + \nabla f(Y)^\top Y) \| \\
        \leq & L_f\|X -Y\| + \frac{1}{2}\| X((X-Y)^\top \nabla f(Y) + \nabla f(Y)^\top (X-Y))\| \\
        & + \frac{1}{2}\| (X-Y)(Y^\top \nabla f(Y) + \nabla f(Y)^\top Y) \| \\
        \leq & L_f \|X-Y\| + \frac{1}{2}(2\hat{D}_f + 3\hat{D}_f) \|X -Y\| \\
        = & (L_f + \frac{5}{2}\hat{D}_f) \|X - Y\|,
    \end{aligned} \]
    where $\hat{D}_f:= \max_{X \in \bar{U}_{{\rm St}(d,r)}(\frac{1}{8})} \|\nabla f(X)\|$,
    the second inequality is due to the contractive property of $\Pcal_{T_{X}{\rm St}(d,r)}$,  and the last inequality is from the fact that $\|Y\|_2 \leq \frac{3}{2}$ . By setting $\hat{L} = L_f + \frac{5}{2}\hat{D}_f$, we complete the proof.
\end{proof}
\textbf{Proof of Lemma 4}
\vspace{-3.7em}
\begin{proof}
    
    It follows that
    \[ \begin{aligned}
        \|X_{k+1} - \bar{X}_{k+1} \| & \leq \| X_{k+1} - \bar{X}_{k} \| \\
        & \leq \| X_{k}  - \mu \varphi(X_k) - \bar{X}_{k} \| + \alpha \| \grad f(X_k)\| \\
        & \leq \frac{2}{3} \| X_k - \bar{X}_k \| + \alpha \| \grad f(X_k)\|. 
    \end{aligned} \]
    We complete the proof.
\end{proof}
\textbf{Proof of Lemma 5}
\begin{proof}
It follows from \eqref{eq:qub} that
    \be \label{eq:descent-ex} \begin{aligned} 
     & f(\bar{X}_{k+1}) - f(\bar{X}_k) \leq  \iprod{\grad f(\bar{X}_k)}{\bar{X}_{k+1} - \bar{X}_k} + \frac{L}{2}\| \bar{X}_{k+1} - \bar{X}_k \|^2 \\
    \leq & \iprod{\grad f(\bar{X}_k)}{\bar{X}_{k+1} - X_{k+1} + X_k- \bar{X}_k} + \iprod{\grad f(\bar{X}_k)}{X_{k+1} - X_k} \\
    & + 2L \| X_{k+1} - X_k \|^2 \\
    \leq &  \iprod{\grad f(\bar{X}_k)}{\bar{X}_{k+1} - X_{k+1}} + \iprod{\grad f(\bar{X}_k)}{X_{k+1} - X_k} \\
    & + 4L (\alpha^2 \|\grad f(X_k)\|^2 + \mu^2 \|\nabla \varphi(X_k)\|^2) \\
    = &  \iprod{\grad f(\bar{X}_k) - \grad f(\bar{X}_{k+1}) }{\bar{X}_{k+1} - X_{k+1}} + \iprod{\grad f(X_k) }{X_{k+1} - X_k} \\
    & + \iprod{\grad f(\bar{X}_k) - \grad f(X_k)}{X_{k+1} - X_k} \\
    & + 4L (\alpha^2 \|\grad f(X_k)\|^2 + \mu^2 \|\nabla \varphi(X_k)\|^2) \\
    \leq & 2\hat{L}^2 \|X_{k+1} - X_k\|^2 + \frac{1}{2} \|X_{k+1} - \bar{X}_{k+1}\|^2 - \alpha \|\grad f(X_k)\|^2 \\
    & - \mu \iprod{\grad f(X_k)}{\nabla \varphi(X_k)} + \frac{1}{2}(\hat{L}^2\|X_k -\bar{X}_k\|^2 + \|X_{k+1} - X_k\|^2) \\
    & + 4L (\alpha^2 \|\grad f(X_k)\|^2 + \mu^2 \|\nabla \varphi(X_k)\|^2) \\
    \leq & - \alpha \|\grad f(X_k)\|^2 - \mu \iprod{\nabla f(X_k)}{\calP_{T_{X_k} {\rm St}(d,r)} (\nabla \varphi(X_k))} + \frac{1}{2}\|X_{k+1} - \bar{X}_{k+1}\|^2 \\
    & + \frac{1}{2}\|X_{k} - \bar{X}_{k}\|^2 + (4\hat{L}^2 + 4L + 1) ( \alpha^2 \| \grad f(X_k) \|^2 + \mu^2 \|\nabla \varphi(X_k)\|^2) \\
    \leq & - (\alpha - (4\hat{L}^2 + 4L + 1) \alpha^2) \|\grad f(X_k)\|^2 + \frac{1}{2}\|X_{k+1} - \bar{X}_{k+1}\|^2 \\
    & + (6 \mu \hat{D}_f + \frac{1}{2} + 16(4\hat{L}^2 + 4L + 1)\mu^2) \|X_k - \bar{X}_k\|^2,
    \end{aligned} \ee
    where the second inequality is from the 2-Lipschitz continuity of $\Pcal_{{\rm St}(d,r)}$ over $\bar{U}_{{\rm St}(d,r)}(\frac{1}{8})$, the third inequality is due to the facts that $X_k - \bar{X}_k \in N_{\bar{X}_k} {\rm St}(d,r)$ and $\iprod{A}{B} \leq \frac{1}{2}(\|A\|^2 + \|B\|^2)$ for any $A, B \in \R^{n \times d}$, and the last inequality comes from 
    \[ \|\calP_{T_{X_k} {\rm St}(d,r)} (\nabla \varphi(X_k)) \| = \|X_k(X_k^\top X_k - I)^2\| \leq 6 \|X_k - \bar{X}_k\|^2. \]
    Plugging $\mu = \frac{1}{3}$ into \eqref{eq:descent-ex} gives \eqref{eq:descent}.
\end{proof}
\textbf{Proof of Theorem.}\begin{proof}
    Applying \cite[Lemma 2]{xu2015augmented} to \eqref{eq:err-feasi} yields
    \be \label{eq:feasi-grad}  \sum_{k=0}^K \|X_k - \bar{X}_{k}\|^2 \leq 18 \alpha^2 \sum_{k=0}^K \|\grad f(\bar{X}_k)\|^2 + 4.  \ee
    Then, summing \eqref{eq:descent} over $k=0, \ldots, K$ gives 
    \be \label{eq:descent2} 
    \begin{aligned}
        & f(\bar{X}_{k+1}) - f(\bar{X}_0) \\
        \leq & - (\alpha - (4\hat{L}^2 + 4L + 1) \alpha^2) \sum_{k=0}^K \|\grad f(X_k)\|^2 \\
        & + \frac{1}{2}\left(4\hat{D}_f + 16\hat{L}^2 + 16 L + 3 \right) \sum_{k=0}^{K+1} \|X_k - \bar{X}_k\|^2 \\
        \leq & - (\alpha - (4\hat{L}^2 + 4L + 1) \alpha^2 +  9(4\hat{D}_f + 16\hat{L}^2 + 16 L + 3 )\alpha^2 ) \sum_{k=0}^K \|\grad f(X_k)\|^2 \\
        & + \frac{1}{2} \left(4\hat{D}_f + 16\hat{L}^2 + 16 L + 3 \right)(18 \alpha^2 \|\grad f(X_{k+1})\|^2 + 4).
    \end{aligned}
    \ee
    Define $c_1 = 148 \hat{L}^2 + 148 L + 36 \hat{D}_f + 28$ and $c_2 = (9 \hat{D}_f^2 + 2)(4\hat{D}_f + 16\hat{L}^2 + 16 L + 4)$. Then, we have
    \[
    \alpha(1 - c_1 \alpha) \sum_{k=0}^K \|\grad f(X_k)\|^2 \leq f(\bar{X}_0) - f(\bar{X}_{k+1}) + c_2. 
    \]
    Therefore, for any $\alpha \leq \frac{1}{2c_1}$, taking $K \rightarrow \infty$ gives $\sum_{k=0}^\infty \|\grad f(X_k)\|^2 < \infty$. Then by \eqref{eq:complexity}, $\sum_{k=0}^\infty \|X_k - \bar{X}_{k}\|^2 < \infty$. These lead to \eqref{eq:complexity}. 
\end{proof}

\section{Hyperparameters}
\label{appendix:hyper-params}
\begin{table}[htbp]
  \centering
  \caption{Hyperparameter setup of Manifold-LoRA for question answering tasks.}
  \label{tab:squad-hyper}
  \begin{tabular}{l l l l }
    \toprule
    Method & Hyperparamter & SQuADv1.1 & SQuADv2.0  \\
    \midrule
    & Warmup Ratio & \multicolumn{2}{c}{0.06} \\
    & LR Schedule & \multicolumn{2}{c}{Linear} \\
    & Weight Decay & \multicolumn{2}{c}{0.1} \\
    & $\beta_1$ & \multicolumn{2}{c}{0.9} \\
    & $\beta_2$ & \multicolumn{2}{c}{0.999} \\
    & Batch Size & \multicolumn{2}{c}{64} \\
    & Learning Rate & \multicolumn{2}{c}{3e-3} \\
    & Epochs & \multicolumn{2}{c}{4} \\
    \midrule
    Sphere(r=8) & $\mu$ & 0.85 & 0.85 \\
    & Lower & 0.25 & 0.25 \\
    & Upper & 0.75 & 0.5 \\
    \midrule
    Sphere(r=16) & $\mu$ & 0.9 & 0.85 \\
    & Lower & 0.25 & 0.25 \\
    & Upper & 0.5 & 0.5 \\
    \midrule
    Stiefel(r=8) & $\mu$ & 0.85 & 0.85 \\
    & Lower & 0.25 & 0.25 \\
    & Upper & 0.5 & 0.5 \\
    \midrule
    Stiefel(r=16) & $\mu$ & 0.9 & 0.85 \\
    & Lower & 0.25 & 0.25 \\
    & Upper & 0.5 & 0.5 \\
    \bottomrule
  \end{tabular}
\end{table}
\begin{table}[htbp]
  \centering
  \caption{Hyperparameter configurations of Manifold-LoRA for  GLUE benchmark}
  \label{tab:glue_hyperparams}
  \begin{tabular}{cccccccccc}
    \toprule
     Method & Hyperparameter & MNLI & SST-2 & CoLA & QQP & QNLI & RTE & MRPC & STS-B \\
    \midrule
    & Warmup Ratio & \multicolumn{8}{c}{0.06} \\
    & LR Schedule & \multicolumn{8}{c}{Linear} \\
    & Max Sequence Length & \multicolumn{8}{c}{256}\\
    & Weight Decay & \multicolumn{8}{c}{0.1}  \\
    & $\beta_1$ & \multicolumn{8}{c}{0.9} \\
    & $\beta_2$ &\multicolumn{8}{c}{0.999} \\
    & Batch Size &\multicolumn{8}{c}{32} \\
    & LoRA Layer & \multicolumn{8}{c}{$W_q, W_v$} \\
    & Epochs & 7 & 24 & 25 & 5 & 5 & 50 & 30 & 25 \\
    & Learning rate & 5e-4 & 8e-4 & 5e-4 & 5e-4 & 1.2e-3 & 1.2e-3 & 1e-3 & 2.2e-3 \\
    \midrule
    
    Sphere(r=16) & $\mu$ & 1 & 0.9 & 0.8 & 0.9 & 0.95 &  1.2 & 0.85 & 0.9 \\
    & Lower & 0.25 & 0.25 & 0.5 & 0.5 & 0.5 & 0.5 & 1 & 1  \\
    & Upper & 2 & 2 & 2 & 4 & 2 & 2 & 4 & 4  \\
    \midrule
    Sphere(r=8) & $\mu$ & 0.95 & 0.95 & 1 & 0.9 & 1 &  0.9 & 0.85 & 1 \\
    & Lower & 2 & 0.5 & 1 & 0.5 & 0.5 & 0.25 & 2 & 1  \\
    & Upper & 8 & 2 & 8 & 2 & 2 & 0.5 & 4 & 8  \\
    \midrule
    Stiefel(r=16) & $\mu$ & 0.8 & 0.85 & 0.95 & 0.9 & 0.95 &  1.2 & 0.8 & 1 \\
    & Lower & 2 & 0.5 & 2 & 0.5 & 0.5 & 0.5 & 1 & 1  \\
    & Upper & 8 & 1 & 8 & 4 & 1 & 2 & 4 & 16  \\
    \midrule
    Stiefel(r=8) & $\mu$ & 0.8 & 0.95 & 0.95 & 0.9 & 0.85 &  0.9 & 1 & 1 \\
    & Lower & 2 & 0.5 & 2 & 0.5 & 0.5 & 0.25 & 1 & 1  \\
    & Upper & 8 & 2 & 8 & 2 & 2 & 1 & 4 & 16  \\
    

  \end{tabular}
\end{table}
% \begin{figure}
%     \begin{subfigure}[t]{0.45\textwidth}
%         \centering
%         \includegraphics[width=\linewidth, ]{e2e/GPT-2-L-Stefiel.pdf}
%         \caption{heat map of $B^\top B$ at the 1000 step with Stiefel constrained}
%         \label{fig:e2e-Stiefel}
%     \end{subfigure}
%     \hfill
%     \begin{subfigure}[t]{0.45\textwidth}
%         \centering
%         \includegraphics[width=\linewidth, ]{e2e/GPT-2-L-sphere.pdf}
%         \caption{heat map of $B^\top B$ at the 1000 step with sphere constrained}
%         \label{fig:e2e-sphere}
%     \end{subfigure}
%     \caption{The figure demonstrates that, in the early stages of training, the matrix $B^\top B$y validating the feasibility and efficacy of our proposed method.}
%     \label{fig:e2e-heatmap}
% \end{figure}
% \begin{figure}
%     \centering
%     \includegraphics
%     [width=\textwidth]{glue/mrpc-sphere.pdf}
%     \caption{The heat map of $B^\top B$ with sphere constrained on MRPC dataset. The elements along the diagonal of $B^\top B$ are nearly 1}
%     \label{fig:mrpc-sphere}
% \end{figure}
% \begin{figure}
%     \centering
%     \includegraphics
%     [width=0.9\textwidth]{glue/stsb-Stefiel.pdf}
%     \caption{Elements of $B^\top B$ at the last iteration. We see that $B$ is approximately orthogonal since $B^\top B \approx I$.}
%     \label{fig:stsb-Stiefel}
% \end{figure}

\begin{table}[htbp]
  \centering
  \caption{Hyperparameter setup of Manifold-LoRA for E2E benchmark.}
  \label{tab:e2e-hyper}
  \begin{tabular}{l l l l}
    \toprule
    Method & Hyperparamter & GPT-2(M) & GPT-2(L)  \\
    \midrule
    & Warmup Steps & \multicolumn{2}{c}{500} \\
    & LR Schedule & \multicolumn{2}{c}{Linear} \\
    & Weight Decay & \multicolumn{2}{c}{0.01} \\
    & $\beta_1$ & \multicolumn{2}{c}{0.9} \\
    & $\beta_2$ & \multicolumn{2}{c}{0.999} \\
    & LoRA dropout & \multicolumn{2}{c}{0} \\ 
    & Batch Size & \multicolumn{2}{c}{8} \\
    & Learning Rate & \multicolumn{2}{c}{2e-4} \\
    & Epochs & \multicolumn{2}{c}{5} \\
    \midrule
    Sphere(r=4) & $\mu$ & 1 & 0.9 \\
    & Lower & 0.5 & 0.5 \\
    & Upper & 2 & 2 \\
    \midrule
    % Sphere(r=8) & $\mu$ & 1 & 1 \\
    % & Lower & 0.5 & 0.5 \\
    % & Upper & 2 & 2 \\
    % \midrule
    Stiefel(r=4) & $\mu$ & 1 & 1.1 \\
    & Lower & 0.5 & 0.5 \\
    & Upper & 4 & 2 \\
    % \midrule
    % Stiefel(r=8) & $\mu$ & 1 & 0.9 \\
    % & Lower & 0.5 & 1 \\
    % & Upper & 2 & 1 \\
    \bottomrule
  \end{tabular}
\end{table}
\newpage
\input{checklist}
\end{document}

