\section{Pareto Navigation Gradient Descent for OPT-in-Pareto} \label{sec: algo}
%We discuss how to solve the OPT-in-Pareto problem. For simplicity, we consider the standard form of OPT-in-Pareto in \eqref{equ: main_problem} and the generalization to \eqref{equ: main_problem_multi} is straightforward. We start with discussing the second order approach, i.e. manifold gradient descent, that can be applied to solve OPT-in-Pareto. We then introduce the proposed first order algorithm named 
We now introduce our main algorithm, Pareto Navigating Gradient Descent (\PNG), which provides a practical approach to OPT-in-Pareto. 
%a first order algorithm that approximately solve \eqref{equ: main_problem} and \eqref{equ: main_problem_multi} 
%by using only gradient information of $F$ and $\{\ell_\i\}$. 
%The idea is to iteratively update $\cc$ in a way that ensures that 
%1) $\cc$
%it moves $\cc$ towards the Pareto set when it is far way from it and 2) it decreases $F$ when $\cc$ is in close to the Pareto set. 
For convenience, we focus on the single point problem in \eqref{equ: main_problem} in the presentation. 
The generalization to the multi-point problem in \eqref{equ: main_problem_multi} is straightforward.  
We first introduce the main idea and then present theoretical analysis in Section~\ref{sec: theory}. 

%Our method performs iterative updates of form 
We consider the general incremental updating rule of form 
$$
\cc_{\k+1} \gets \cc_\k - \xi v_\k,
$$
where $\xi$ is the step size 
and $v_\k$ is an update direction that we shall choose to 
achieve the following desiderata in balancing the decent of $\{\ell_\i\}$ and $F$: 

i) When $\cc_\k$ is far away from the Pareto set, we want to choose $v_\k$ to give Pareto improvement to $\cc_\k$, moving it towards the Pareto set. The amount of Pareto improvement might depend on how far $\cc_\k$ is to the Pareto set.

ii) If the directions that yield Pareto improvement are not unique, we want to choose the Pareto improvement direction that decreases $F(\cc)$ most. 
%Meanwhile, we want to optimize $\cc$. 

iii) When $\cc_\k$ is very close to the Pareto set,  
e.g., having a small  $g(\cc)$, 
%\ep$ for some tolerance parameter $\ep$, 
we want to fully optimize $F(\cc)$.  


%sufficiently close to 
% (i.e., simutanesouly )


%Our idea is based on a simple intuition inspired by MGD: if we move $\th$ towards the direction that decreases the losses of all the tasks, we eventual end up with $\th \in \P_\epsilon$. It suggests that the task losses $\{\ell_\i\}$ can be a desirable proxy to move $\th$ towards $\P_\epsilon$. Specifically, initializing with $\th_0$, at step $\k$, we update the parameter $\th_{\k}$ by $\th_{\k+1}=\th_{\k}-\xi d(\th_{\k},\alpha_\k)$ along an update direction $d(\th_{\k},\alpha_\k)$, which should balance between moving $\th_\k$ towards $P_\epsilon$, and minimizing the criterion $F$ with an additional parameter $\alpha_\k$. We formulate the choice of $d(\th, \alpha)$ into the following sub-problem:
%such that the loss decreases fastest and $\th_{\k+1}$ is still within $P_\epsilon$. Ideally, one way to obtain $d(\th)$ is by solving the following sub-problem each iteration: 
%of $\th$ when it is at boundary directly using the task losses
%rather than $g(\th)$. This leads to the following updating direction:
We achieve the desiderata above by using the $v_\k$ that solves the following optimization: 
\begin{align} \label{opt: relax}
& v_\k =\argmin_{v \in \mathbb{R}^n}\left \{  \frac12\left\Vert \nabla F(\th_t)-v \right\Vert ^{2} \right\}
\\
\nonumber
&\text{s.t.  }
\nabla_{\th}\ell_\i(\th_t) \tt v % \right\rangle 
\ge \phi_\k,  ~~~~\forall \i\in [m], 
\end{align}
where we want $v_\k$ to be as close to $\nabla F(\th_t)$ as possible (hence decrease $F$ most), conditional on that the decreasing rate $\nabla_{\th}\ell_\i(\th_t) \tt v_t $  of all losses $\ell_\i$ 
%on the condition that its inner product with $\nabla \ell_\i(\theta)$ 
are lower bounded by a \emph{control parameter} $\phi_\k$. 
%A positive and large  $\phi_\k$ would enforce a large Pareto improvement. 
A positive $\phi_\k$ enforces that $\nabla_{\th_t}\ell_\i(\th) \tt v_\k$ is positive for all $\ell_\i$, 
hence ensuring a Pareto  improvement when the step size is sufficiently small. The magnitude of $\phi_\k$ controls 
how much Pareto improvement we want to enforce, so we may want to gradually decrease $\phi_\k$ when we move closer to the Pareto set. 
%If $\phi_\k$ is positive (and large) when $\th$ is away from $\P_\epsilon$ so that all $\ell_\i$ are decreased monotonically to move $\th$ towards $\P_\epsilon$; and is small when $\th$ is in the interior of $\P_\epsilon$ so that $d(\th)$ can be made close to $\nabla F(\th)$ to decrease $F(\th)$.\qq{do we need to include alpha in $\phi(\th,\alpha)$ and $d(\theta,\alpha)$?}
%\qq{use $\phi_\i$ and $$}
%The particular choice of $\phi$ can be flexible.
%
In fact, varying $\phi_\k$ provides an intermediate updating direction between the vanilla gradient descent on $F$ and MGD on $\{\ell_\i\}$: 

i) If $\phi_\k = -\infty$, we have $v_\k = \dd F(\cc_\k)$ and it conducts a pure gradient descent on $F$ without considering $\{\ell_\i\}$. 

%reduces to standard gradient descent on $F$). We allow different $\alpha_\k$ at different steps.

ii) If $\phi_\k \to +\infty$, 
 %one can show that 
 then 
 $v_\k$ approaches to the MGD direction of 
 $\{\ell_i\}$ in \eqref{equ: update mgd} 
 without considering $F$. %\qq{proof?} % (upto a different step size). 
%$d(\th,\alpha)/\alpha $ approaches to the updating direction of; 

In this work, we propose to choose $\phi_\k$ based on the minimum gradient norm $g(\cc_\k)$ in \eqref{equ: pareto stationary} as a surrogate indication of Pareto local optimality. In particular, we consider the following simple design: 
\begin{align}\label{equ:phi}
\phi_\k = \begin{cases} 
-\infty & \text{if $g(\cc_\k) \leq \ep$},  \\
%\begin{}-\infty\mathbb{I}\{g(\th)\le r\epsilon\}+
\alpha_\k g(\th_\k) & \text{if $g(\cc_\k) >  \ep$},
%\mathbb{I}\{g(\th)>r\epsilon\}.
\end{cases} 
\end{align}
where $\ep$ is a small tolerance parameter 
and $\alpha_\k$ is a positive hyper-parameter.  
%Here we turn off 
When $g(\cc_\k) > \ep$, 
we set $\phi_\k$ to be proportional to $g(\cc_\k)$, to ensure Pareto improvement based on how far $\cc_\k$ is to Pareto set. 
When $g(\cc_\k) \leq \ep$, 
we set $\phi_\k =-\infty$ which ``turns off'' the control and hence fully optimizes $F(\cc)$. % (we have $v_\k = \dd F(\cc_\k)$ in this case). 
%when 

%where $r \in (0,1)$ and $\alpha > 0$. 
%When $g(\th) > r\epsilon$, 
%it imposes a constraint of $\left\langle \nabla_{\th}\ell_\i(\th),d\right\rangle \ge \alpha g(\th) > r\epsilon$, ensuring that all $\ell_\i$ are decreased so that $\theta$ moves towards the Pareto set; 
%when $g(\th)\leq r \epsilon$, it turns off the constraints with $\phi(\th)=-\infty$, which allows us to take $d(\th, \alpha)= \nabla F(\theta)$ and hence fully optimize $F$.  Here $r$ is a ``buffering" parameter, which ensures that the final solution approximately solves \eqref{equ: relax_problem}. Intuitively, we only allow the algorithm to move $\th$ freely (i.e., $\phi(\th,\alpha)=-\infty$) when $\th$ is inside $\P_\epsilon$ and is far from the boundary. See Section \ref{sec: theory} for more details. $\alpha $ controls the decreasing rate of $\ell_\i(\th)$ when $g(\th) > r \epsilon$. In fact, it controls the trade-off between purely optimizing $F$ vs. MGD (when $\alpha = \infty$, $d(\th,\alpha)/\alpha$ approaches to the MGD direction; 
%and $\alpha = -\infty$ reduces to standard gradient descent of $F$). We allow different $\alpha_\k$ at different steps.

\iffalse 
\begin{figure}
\begin{centering}
\includegraphics[scale=0.31]{fig/Pngrad.pdf}
\par\end{centering}
\caption{Typical Trajectory} \label{fig: illustration}
\end{figure}
\fi 

\begin{algorithm*}[t]
\begin{algorithmic}[1]
\State{Initialize $\theta_0$; decide the
    step size $\xi$, and the control function $\phi$ in \eqref{equ:phi} (including the threshold $\ep >0$ and the descending rate $\{\alpha_\k\}$).}
\For{iteration $\k$}
    \vspace{-0.4cm}
    \bbb\label{equ:vk00}   \theta_{\k+1} \gets \theta_\k - \xi v_t, && 
 v_t = \nabla F(\theta_\k) + \textstyle{\sum}_{i=1}^m \lambda_{\i,\k} \nabla \ell_\i(\theta_\k) ,
    \vspace{-0.2cm}
    \eee 
    where $\lambda_{\i,\k} =0,~\forall \i\in[m]$ if $g(\th_\k) \leq \ep$, and $\{\lambda_{\i,\k}\}_{t=1}^m$ is the solution of (\ref{equ: dual}) with $\phi(\th_\k)=\alpha_\k g(\th_\k)$ when  $g(\th_\k) > \ep$.
\EndFor
\end{algorithmic}\caption{Pareto Navigating Gradient Descent}\label{alg:main}
\end{algorithm*}

In practice, the optimization in \eqref{opt: relax} can be  solved efficiently by  its dual form as follows. % problem, as shown below. 
%Theorem \ref{thm: sol} shows how to compute the solution of (\ref{opt: relax}) based on Lagrangian duality theory.
\begin{theorem} \label{thm: sol}
The solution $v_t$ of \eqref{opt: relax},  
if it exists, 
has a form of 
%has the form of shown in \eqref{equ:vk00}, 
\bbb \label{equ:vk0}
v_\k = \dd F(\cc_\k) + \sum_{t=1}^m \lambda_{\i,\k} \dd \ell_\i(\cc_\k),
\eee 
with $\{\lambda_{\i,\k}\}_{t=1}^m$ %given by % 
the solution of the following dual problem 
\bbb \label{equ: dual}
\max_{\lambda\in\RRplus^m}-\frac{1}{2}|| \nabla F(\cc_\k)+\sum_{\i=1}^{m}\lambda_{\k}\nabla\ell_\i(\th_\k)|| ^{2}+\sum_{\i=1}^{m}\lambda_\i\phi_\k. 
\eee 
\end{theorem}

The optimization in \eqref{equ: dual} can be solved efficiently for a small $m$ (e..g, $m\leq 10$), which is the case for typical applications. 
%We summarize 
%\textbf{Practical Implementation}
We include the details of the practical implementation in Algorithm~\ref{alg:main}.


%Intuitively, the criterion objective $F$ navigates the algorithm to find a desired model in Pareto set and thus we name our algorithm Navigated Pareto Optimization (NPO).
% \paragraph{Remark}



%Here $r \in (0,1)$ is a "buffering" parameter indicating whether $\th$ is close to the boundary of $\P_\epsilon$. Intuitively, we only allow the algorithm to move $\th$ freely (i.e., $\phi(\th,\alpha)=-\infty$) when $\th$ is in the interior of $\P_\epsilon$ and is far away enough from the boundary so that $\th$ is still in $\P_\epsilon$ after updating. $\alpha $ controls the decreasing rate of $\ell_\i(\th)$ when $g(\th) > r \epsilon$. In fact, it controls the trade-off between purely optimizing $F$ vs. MGD (when $\alpha = \infty$, $d(\th,\alpha)/\alpha$ approaches to the updating direction of; and $\alpha = -\infty$ reduces to standard gradient descent on $F$). We allow different $\alpha_\k$ at different steps.

%We want to point out that control the movement of $\th$ based on $\left\langle \nabla_{\th}\ell_\i(\th),d\right\rangle$ is not guaranteed to ensure the final solution $\th\in\P_\epsilon$. But it ensures that the final solution is within a slightly different set that is as good as $\P_\epsilon$ in terms of the performances on each task. We defer the detail to Section \ref{sec: theory}.

%\textbf{Avoid Hessian Computation by Relaxation}We avoid Hessian computation by expanding the feasible set, allowing $\th$ to be at a region that is outside but close to $\P$. Specifically, we consider minimizing $F$ within the $\epsilon$-envelop of the Pareto stationary set $\P_{\epsilon}$:
%\begin{eqnarray}\label{equ: relax_problem}{\min}_{\th\in \P_\epsilon}F(\th),\ \ \ \  \P_{\epsilon}:=\left\{ \th:\ {g(\th)}\le\epsilon\right\}.\end{eqnarray}
%This relaxation expands the constraint space from a manifold with a strictly smaller dimension than the original parameter space to a subset with the same dimension as the parameter space, eventually getting rid of calculating the tangent space and thus Hessian computation. This slight relaxation has two advantages: 1. In the manifold gradient descent, ensuring $\th_\k \in \P$ for every step seems too restrictive, since, we only require the model at convergence is within $\P$. Allowing $\th$ to leave $\P$ during the optimization can be beneficial to the optimization; 2. The accuracy loss caused by the relaxation is fully controllable and can even be negligible\footnote{
%Actually, in practice, any gradient-based approach using finite order Taylor approximation with non-infinitesimal learning rate always has a higher-order error term, which makes it almost impossible to find a model in $\P_0 = \P$.
%} since we can choose or dynamically adjust $\epsilon$ during the optimization\footnote{As a starting point, for this paper, we consider a fixed $\epsilon$ for simplicity.} to ensure $\epsilon$ is small enough at the convergence.

%\textbf{A Proxy Control for Solving the Constraint Optimization}Solving the constraint optimization in \eqref{equ: relax_problem} is still challenging. The main difficulty is in that the constraint set $\P_\epsilon$ is defined based on a complicated function $g(\th)$ implicitly defined by solving another constraint optimization in \eqref{equ: pareto stationary}. Accessing its geometric property (i.e., finding the projection on the constraint set) and gradient information (i.e., calculating $\nabla_\th g(\th)$) is difficult and expensive. As a consequence, classical techniques such as projected gradient descent, penalty method, Lagrange multipliers based approach or sequential quadratic programming are no more applicable. We consider a control-type algorithm for the problem. Intuitively, we aim to update the parameter $\th$ such that
%\begin{enumerate}
    %\item[1:] When $\th$ is outside $\P_\epsilon$, we find an updating direction that decreases $F$ most (or increases $F$ least if we can not decrease it) among all the directions that push $\th$ to move towards $\P_\epsilon$.
    
    %\item[2:] When $\th$ is in the interior of $\P_\epsilon$, we choose an updating direction that decreases $F$ most.
    
    %\item[3:] When $\th$ is close to the boundary of $\P_\epsilon$, we find an updating direction that decreases $F$ most among all the directions that do not move $\th$ outside $\P_\epsilon$.
%\end{enumerate}
%Note that the updating direction in bullet points 2 and 3 can be a 0 displacement, i.e. a termination if all feasible directions can not decrease $F$. As discussed before, finding directions that pushes $\th$ towards $\P_\epsilon$ or does not move $\th$ outside of $\P_\epsilon$ is difficult, as it is hard to access $\nabla_\th g(\th)$. A different mechanism other than the gradient information of $g(\th)$ is needed.


%\textbf{Solve OPT-in-Pareto by Manifold Gradient Descent} By viewing $\P$ as a manifold on the original parameter space $\Theta$, a natural and straightforward solution is to deploy manifold gradient descent \citep{bonnabel2013stochastic} to solve OPT-in-Pareto, in which we move $\th$ within the manifold, toward the steepest direction decreases $F$. Starting with $\th\in \P$, manifold gradient descent iteratively updates $\th$ along the direction of the projection of $\nabla F(\th)$ on tangent space $T(\th)$ at $\th$, i.e., at step $\k$, we update $\th_\k$ via
%\[\th_{\k+1}=\th_\k-\xi\text{Proj}_{T(\th_\k)}(\nabla F(\th_\k)).\]
%With infinitesimal learning rate $\xi$, we ensure that $\th_\k\in P$ for any $\k$. 
%The key issue of manifold gradient descent is the computation of the tangent space $T(\th_\k)$, the calculation of which requires the Hessian matrix w.r.t. the losses of the tasks \citep{hillermeier2001generalized}. Despite several techniques such as Krylov subspace iteration \citep{ma2020efficient} or conjugate gradient descent \citep{koh2017understanding} can be applied to reduce the computational cost, computing Hessian is still quite expensive in deep learning.
% Notice that $P$ can be viewed as a manifold in model space, and thus
% a natural while expensive algorithm is to apply gradient descent on
% manifold. Start with $\th_{0}\in P$ and at step $\k$, we first calculate
% the Tangent space $T(\th_{\k})$ of $P$ and then the direction $d(\th_{\k})$
% for updating can be obtained by solving the following problem: 
% \[
% \max_{\left\Vert d\right\Vert =1}\left\langle d,\nabla F(\th_{\k})\right\rangle \ \text{s.t.}\ d\in T(\th_{\k}).
% \]
% We then update the parameter by $\th_{\k+1}\leftarrow\th_{\k}-\xi d(\th_{\k})$
% with learning rate $\xi$. The key issue of this algorithm lies in
% the difficulty of calculating the Tangent space, which involves the
% calculating of Hessian matrix. Although there are several ways such
% as {[}xxxx{]} that can reduce the computational cost, it is still
% inefficient and not likely scalable.

\section{Theoretical Properties} \label{sec: theory}

We provide a theoretical quantification on how {\PNG} 
guarantees to i) move the solution towards the Pareto set (Theorem~\ref{thm:odeell}); 
and ii) optimize $F$ in a neighborhood of Pareto set (Theorem~\ref{thm:odef}). 
%We study the theoretical property of our method.  
%The goal is to answer two questions for general non-convex $\{\ell_\i\}$ and $F$:  
%1) how does PNGD moves $\cc_\k$ towards the Pareto local optimal set, and 2) how it optimizes $F(\cc)$ in a neighborhood of the Pareto set in the sense made precise below. 
%
%
To simplify the result and highlight the intuition, we focus on the continuous time limit of {\PNG},  
which yields a differentiation equation $\df  \cc_\tim = - v_\tim \df \tim$ with $v_\tim$ defined in \eqref{opt: relax}, 
where $t\in\RRplus$ is a continuous integration time. 
%\red{A similar but more technical analysis of the  discrete time can be found in Appendix. }

%\subsubsection{Descent of $\ell_\i$}
\begin{assumption}\label{asm:basic}
Let $\{\cc_t\colon t\in\RRplus\}$ be a solution of $\df \cc_t = -v_t \df t$ with $v_t$ in \eqref{opt: relax}; $\phi_k$ in \eqref{equ:phi};  $\ep>0$; and $\alpha_t \geq 0$,$\forall t\in \RRplus$. %>0,\forall t\in\RRplus$.  
Assume $F$ and $\L$ are continuously differentiable on 
$\RR^\dimcc$, and lower bounded with 
$F\true\defeq \inf_{\cc\in \RR^\dimcc}F(\cc) > -\infty$ and $
\ell_\i\true \defeq \inf_{\cc\in \RR^\dimcc}\ell_\i(\cc)  > -\infty$. 
Assume $\sup_{\cc\in\RR^\dimcc}\norm{\dd F(\cc)}\leq c$. 
%Assume the solution of \eqref{opt:relax} exists 
\end{assumption}

Technically, 
$\df \cc_t = -v_t \df t$ is a piecewise smooth dynamical  system whose solution should be taken in the Filippov sense using the notion of {differential inclusion} \citep{bernardo2008piecewise}.   
%The solution also exists
%There always exists a 
The solution always exists %for \red{bounded velocity fields}
under mild regularity conditions 
although it may not be unique. 
%The solution may not be unique by 
Our results below apply to all  solutions. 
%The solution of  $\df \cc_t = -v_t \df t$ also exists but may not be unique. 
%results with discrete time  
%follows analogously and is included in the Appendix.  
%We investigate the descent and convergence property of PNG. For simplicity, we consider a fixed step size scheme (i.e., $\xi$ = const) and we assume $\th_0 \in \P_\epsilon$. To obtain $\th_0 \in \P_\epsilon$, any multitask learning algorithm might be applied including the linear scalairzation, MGD and the proposed PNG with $\alpha_\k$ uniformly lower bounded by a strictly positive value. Generally, we are able to find $\th_0 \in \P_\epsilon$ within $\mathcal{O}(1/\epsilon)$ steps. Generalizing the result to different step size scheme is straightforward.
%\red{To justify PNG, we seek answers to the following questions: 1.) In what sense PNG decreases $F$; 2.) whether PNG converges in an good rate; 3.) what property the solution returned by PNG has.} 
%Our analysis focus on general non-convex functions and hence we use the gradient norm as the criterion of local optimality. 

%To quantify the theoretical properties of our method, we need to define a proper notion of neighborhood of $\P$ and a criterion for (approximate) local optimality of $F$ inside the neighborhood of $\P$. 

\subsection{Pareto Optimization} 
We 
now show that the algorithm converges to the vicinity of Pareto set 
quantified by a notion of Pareto closure. 
%to the Pareto set . 
%We give an quick overview of the result. 
For $\epsilon\geq 0$, 
let $\P_\epsilon$   
%We define 
be the set of Pareto $\epsilon$-stationary points: 
$\P_{\epsilon} = \{\cc\in \RR^\dimcc \colon ~ g(\cc) \leq \epsilon\}$. %Further, for $u\geq 0$, 
%We define 
The Pareto closure of a set $\P_{\epsilon}$, denoted by $\overline\P_\epsilon$ is the set of points 
that perform no worse than at least one point in $\P_{\epsilon}$, that is, % upto a slack $u$, that is, 
%Our results involve the following notion of $(\epsilon, u)$-closure of $\P$: 
\begin{align*}
\overline{\P}_{\epsilon}:= \cup_{\cc\in \P_{\epsilon}} \overline{\{\cc\}}, && 
\overline{\{\cc\}} = \{\cc'\in \RR^\dimcc\colon ~~ \L(\cc') \preceq \L(\cc)\}.
%\left\{ \th:\ \exists\th'\in \P_{\epsilon}\ ~~\text{s.t.}\ ~~ %\forall \i\in[m],\ 
%\L(\cc) \preceq \L(\cc')\}.
%\overline{\P}_{\epsilon}:=\left\{ \th:\ \exists\th'\in \P_{\epsilon}\ ~~\text{s.t.}\ ~~ %\forall \i\in[m],\ 
%\L(\cc) \preceq \L(\cc')\}.
%\max_{\i\in[m]}\ell_\i(\th)-\ell_\i(\th')\leq u\right\}.
\end{align*}
%In particular, ${\P}_{\epsilon, 0}$ is the set of points $\cc$ such that $\L(\cc) \preceq \L(\cc')$ for some point $\cc'$ in $\P_\epsilon$.
%that Pareto dominate (or equivalent to)  at least one point in $\P_\epsilon$. 
%that performs no worse than some parameter in $\P_\epsilon$, which is a natural set to consider.
Therefore, 
$\overline\P_\epsilon$ is better than  or at least as good as $\P_\epsilon$ 
in terms of Pareto efficiency.% $\L$.
%( better)
%there is no loss in terms of objectives $\L$ if we replace $\P_\epsilon$ with $\overline\P_\epsilon$. 
%if $\P_\epsilon$ is a good proxy of the Pareto set, 
%so is $\overline\P_{\epsilon}$. % for small $u$. 
%As shown in \red{XXX}, a key property of our method is that it guarantees to enter $\P_{\epsilon,u}$ and stays within it afterwards for some small $\epsilon$ and $u$ that depends on the step size $\xi$ and the control parameters $e,\alpha_\k$ in \eqref{equ:phi}. 
%In addition, in \red{XXX}, we show that our method minimize $F(\cc)$ inside $\overline \P_{\epsilon}$ in the sense of \red{XXX}. 
%can be shown %$\bar{\P}_{\epsilon, u}$ gives some extra freedom allowing the loss of tasks to be sightly larger.
%to be the set of points st
%that is, $$

\begin{theorem}[Pareto Improvement on $\L$]  \label{thm:odeell}
 Under Assumption~\ref{asm:basic}, 
assume $\cc_0\not\in\P_\ep$, and $t_\ep$ is the first time when $\cc_{t_\ep} \in \P_\ep$, then for any time $t < t_\ep,$ 
\bb 
\frac{\df}{\df t} \ell_\i( \cc_t) \leq -\alpha_t g(\cc_t),\ 
\min_{s\in[0,t]} g(\cc_{\color{black}s}) \leq \frac{\min_{\i\in[m]}(\ell_\i(\cc_0)-\ell_i\true)}{\int_0^t \alpha_s \df s}.
\ee 
Therefore,  the update yields Pareto improvement on $\L$ when $\cc_t \not\in \P_\ep$ and  $\alpha_t g(\cc_t)>0$.

%Therefore, 
Further, 
if $\int_0^t \alpha_s \df s = +\infty$, 
then for any $\epsilon>\ep$,  there exists a finite time $t_\epsilon \in \RRplus$ on which the solution enters $\P_{\epsilon}$ and stays within $\overline \P_\epsilon$ afterwards, that is, we have $\cc_{t_\epsilon} \in \P_\epsilon$ and $\cc_t\in \overline \P_\epsilon$ 
%$\{\cc_{t}\colon ~ t\geq t_\epsilon\} \subseteq \overline \P_\epsilon $. 
for any $t \geq t_\epsilon$. 
%then for any $t$ 
%1) We achieve Pareto improvement on $\L$ when $g(\cc) > \epsilon$, and hence it reaches $\P_{\epsilon}$ within  $O(1/\epsilon)$ steps.
%2) Once it reaches $\P_{\epsilon}$, it stays within $\P_{\epsilon,u}$ for $u=xxx$ afterwards. 
\end{theorem}
Here we guarantee that $\cc_t$ must enter
 $\P_\epsilon$ for some time (in fact infinitely often), 
but it is not confined in $\P_\epsilon$. % all the time. 
%but may leave it. 
%for at least 
%Note that $\cc_t$ may leave $\P_\epsilon$ %, $\epsilon >\ep$  after it first enters $\P_\epsilon$.
%We emphasis that
On the other hand, 
$\cc_t$ does not leave $\overline \P_\epsilon$ after it first enters $\P_\epsilon$ thanks to the Pareto improvement property. 

\subsection{Criterion Optimization}
We now show that {\PNG} finds a local optimum of $F$ inside the Pareto closure $\overline \P_{\epsilon}$ in an approximate sense. 
We first show 
that a fixed point $\cc$ of the algorithm that is locally convex on $F$ and $\L$ must be a local optimum of $F$ in the Pareto closure of $\{\cc\}$, and then quantify the convergence of the algorithm. % to the fixed point. %show that the algorithm yields 
%a decay on $\norm{v_t}$.  
\begin{theorem}[PNG Finds Local Optimum]\label{lem:dfjijgfgfgfgifjg} 
Under Assumption~\ref{asm:basic}, we have

If $\cc_t \not\in \P_\ep$  
is a fixed point of the algorithm, that is, $\frac{\df \theta_t}{\df t} = -v_t = 0$, and $F$, $\L$ are  convex in a neighborhood $\cc_t$, then 
$\cc_t$ is a local minimum of $F$
in the Pareto closure $\overline{\{\theta_t\}}$, %of  $\{\cc_t\}$,
that is,  there exists a neighborhood of $\cc_t$ in which  there exists no point $\cc'$ such that $F(\cc') < F(\cc_t)$ and $\L(\cc') \preceq \L(\cc_t)$. 

If $\cc_t\in \P_\ep$, we have 
$v_t = \dd F(\cc_t)$, 
and hence a fixed point with $\frac{\df \theta_t}{\df t} = -v_t = 0$ is an unconstrained local minimum of $F$ when $F$ is locally convex on $\cc_t$. 
\end{theorem}
% is a necessary condition of $\cc_t$ being an unconstrained local minimum of $F$. 
\begin{theorem}[Convergence] \label{thm:odef}
Let  $\epsilon > \ep$ and assume $g_{\epsilon} \defeq \sup_{\cc} \{g(\cc) \colon ~\cc\in \overline \P_\epsilon\}<+\infty$ and  $\sup_{t\geq0}\alpha_t<\infty$.   
Under Assumption~\ref{asm:basic}, 
when we initialize from $\cc_0 \in \P_\epsilon$, % for $\forall \epsilon > \ep$, 
we have 
$$
%\min_{s\in[0,t]}\norm{v_s}^2 \leq 
\min_{s\in[0,t]}\norm{ \frac{\df\cc_s}{\df s} }^2 \leq 
\frac{F(\cc_0)-F\true}{t} + \frac{1}{t}\int_{0}^t 
\alpha_s \left (\alpha_s {g_\epsilon}    + 
c \sqrt{g_\epsilon}  \right ) 
%\sqrt{g(\cc_s)}
\df s. 
%F(\cc_%t)
$$
In particular, 
if we have $\alpha_t = \alpha = const$, then 
$\min_{s\in[0,t]}\norm{\df \theta_s/\df s}^2  = \bigO\left (1/t + \alpha \sqrt{g_\epsilon} \right ).$

If  
$%A^\gamma \defeq 
\int_0^\infty\alpha_t^\gamma\df t <+\infty$ for some $\gamma \geq 1$, we have 
$\min_{s\in[0,t]}\norm{\df \theta_s/\df s}^2  = \bigO(1/t + \sqrt{g_\epsilon}/t^{1/\gamma}).$
\end{theorem}


%It is clear that the choce%
Combining the results in Theorem~\ref{thm:odeell} and \ref{thm:odef}, we can see that the choice of   sequence $\{\alpha_t\colon t\in \RRplus \}$ controls how fast we want to decrease $\L$ vs. $F$. Large $\alpha_t$ yields faster descent on $\L$, but slower descent on $F$. 
%
Theoretically, 
using a sequence that satisfies $\int\alpha_t\df t =+\infty$ and $\int \alpha_t^\gamma \df t<+\infty$ for some $\gamma>1$   allows us to ensure that both $\min_{s\in[0,t]} g(\cc_s)$ and $\min_{s\in[0,t]} \norm{\df \theta/\df s}^2$ converge to zero. 
%
If we use a  constant sequence $\alpha_\tim =\alpha$, 
it introduces an $\bigO(\alpha \sqrt{g_\epsilon})$ term that does not vanish as $t\to +\infty$. However, 
%In practice, however, we find that it works sufficiently well by using a constant sequence $\alpha_\tim =\alpha$.  
%\paragraph{Remark} 
we can expect that $g_\epsilon$ is  small when $\epsilon$ is small
for well-behaved functions. %\red{ In appendix, we show that $g_\epsilon = \bigO(\epsilon)$ for functions that is Lipschitz smooth and satisfies Polayk Lojasiewicz conditions.} \qq{remove} 
In practice, we find that constant $\alpha_\tim$
 works sufficiently well. 
 

\iffalse 

\iffalse 
Assume $\cc_t$ is a fixed point of the algorithm which satisfies $v_t = 0$, then 

i) If $\cc_t\in \P_\ep$, we have $v_t = \dd F(\cc_t)=0$.

ii) If $\cc_t \not\in \P_\ep$ and $\dd F(\cc_t)\neq 0$, 
then $\cc_t$ is a local minimum of $F$ inside the Pareto closure of $\cc_t$, that is, there exists a neighborhood of $\cc_t$ in which there exists no point $\cc'$ such that $F(\cc') < F(\cc_t)$ and $\L(\cc') \preceq \L(\cc_t)$.  
%$\phi_t < 0$ (including $\phi_t=-\infty$), 
%If $v_t =0$, then we have 
%\end{theorem}
\begin{proof}
The case when $\cc_t \in \P_\ep$ is obvious. We focus on the case when $\cc_t \not\in \P_e$. In this case, we have 
$$
v_t = \dd F(\cc_t) + \sum_{\i =1}^m \lambda_{\i, t} \dd \ell_i(\cc_t). 
$$
If the conclusion does not hold, then there exists a sequence $\{\cc_{t,j}\}_{j=1}^\infty$, such that $F(\cc_{t,j}) < F(\cc_{t})$ and $\L(\cc_{t,j}) \preceq \L(\cc_t)$, and $\lim_{j\to \infty}\cc_{t,j} = \cc_t$. 
This means that 
$$
F(\cc_{t,j}) 
$$
\end{proof}
\fi 
%\begin{} 
Before we proceed, we introduce several assumptions.
\begin{assumption}[Boundness and Smoothness] \label{asm: bound}
Suppose $F(\th)\geq 0$ for all $\forall \th \in \Theta$.  
There exists a constant $c\in(0,\infty)$ such that 
$$
\sup_{\th\in\Theta}\left  ( 
\norm{\nabla F(\th)}, ~~~
F(\cc),~~~
\sup_{\i\in[m]}\norm{\nabla\ell_\i(\th)} \right) \leq c. 
$$
In addition, there exists a constant $L\in(0,\infty)$, such that 
%$\dd F$ and 
%Suppose there exist constan%ts $c,L<\infty$ such that $\sup_{\th\in\Theta}\left\Vert \nabla %F(\th)\right\Vert \le c$,
%$\sup_{\th\in\Theta,\i\in[m]}\left\Vert \nabla\ell_\i(\th)\right\Vert \le c$,
%$\forall\th\in\Theta$, $0\le F(\th)\le c$ and for any $\th,\th'\in\Theta$,
$\norm{ \nabla F(\th)-\nabla F(\th')} \le L\norm{ \th-\th'}  $,
and $\norm{  \nabla\ell_\i(\th)-\nabla\ell_\i(\th')} \le L\left\Vert \th-\th'\right\Vert $ for $\forall \i\in[m]$.
\end{assumption}

\iffalse 
\begin{assumption} \label{asm: bound}
Suppose there exist constants $c,L<\infty$ such that $\sup_{\th\in\Theta}\left\Vert \nabla F(\th)\right\Vert \le c$,
$\sup_{\th\in\Theta,\i\in[m]}\left\Vert \nabla\ell_\i(\th)\right\Vert \le c$,
$\forall\th\in\Theta$, $0\le F(\th)\le c$ and for any $\th,\th'\in\Theta$,
$\left\Vert \nabla F(\th)-\nabla F(\th')\right\Vert \le L\left\Vert \th-\th'\right\Vert $,
$\max_{\i\in[m]}\left\Vert \nabla\ell_\i(\th)-\nabla\ell_\i(\th')\right\Vert \le L\left\Vert \th-\th'\right\Vert $.
\end{assumption}
\fi 

\begin{assumption}[Step Size] \label{asm: lr}
%Suppose that
The step size $\xi$ is no larger than 
$%\displaystyle 
\min\left (\frac{2(1-\sqrt{r})\sqrt{\epsilon}}{cL},\frac{1}{L},
%\frac{r\epsilon}{c\sqrt{2Lc}}
{\color{red}\frac{(r\epsilon)^2}{2Lc^3}}
\right )$, for the $r$ and $\epsilon$ in \eqref{equ:phi}. 
%$\$
\qq{I do not understand this. check later.}

\end{assumption}

\begin{assumption} \label{asm: a}
Suppose that $\sum_{k=0}^{\infty}\alpha_{\k}<\infty$ and $\alpha_\k \geq 0$.
\end{assumption}

\textbf{Remark on the Assumptions}
Assumption \ref{asm: bound} is a standard assumption on the boundedness of $F$ and gradient as well as the smoothness of the objectives. Assumption \ref{asm: lr} shows the requirement of the learning rate. Its dependence on $r$ and $\epsilon$ is due to our use of $r\epsilon$ to control the behavior of dynamics. Assumption \ref{asm: a} is mainly required for the convergence of the algorithm.

%Before we proceed to introduce the result, we give 
Our results involve 
the following notion of $(\epsilon, u)$-closure of $\P$: 
\begin{eqnarray}
\bar{\P}_{\epsilon,u}:=\left\{ \th:\ \exists\th'\in \P_{\epsilon}\ \text{s.t.}\ \forall \i\in[m],\ \ell_\i(\th)\le\ell_\i(\th')+u\right\}.
\end{eqnarray}
Specifically, $\bar{\P}_{\epsilon, 0}$ is the set of models that performs no worse than some parameter in $\P_\epsilon$, which is a natural set to consider. $\bar{\P}_{\epsilon, u}$ gives some extra freedom allowing the loss of tasks to be sightly larger.
%

\begin{theorem} \label{thm: converge}
Under Assumption \ref{asm: bound}, \ref{asm: lr} and assuming $\th_0\in\P_\epsilon$, we have:
\[
F(\th_{\k+1})-F(\th_{\k})\le-(\xi-\frac{\xi^{2}L}{2})\left\Vert d(\th_{\k},\alpha_{\k})\right\Vert ^{2}+\xi\alpha_{\k}g(\th_{\k})\mathbb{I}\{g(\th_{\k})>r\epsilon\}\sum_{\i\in[m]}\lambda_\i(\th_{\k},\alpha_{\k}).
\]
And for any $k\in\mathbb{N}\cup\{+\infty\}$, we have $\th_{\k}\in\bar{\P}_{\epsilon,\xi Lc}$.
\end{theorem}

\textbf{Remark on the Descent Property}
When $g(\th_\k)\le r\epsilon$, we are simply doing gradient descent on $F$ and thus the decrease on $F$ is simply proportional to the $\ell_2$ norm of the gradient. When $g(\th_\k) > r\epsilon$, an additional term may appear due to the fact that we are pushing $\th$ towards Pareto set and we expect an increase of $F$ if the gradient of $\ell_\i$ and the gradient of $F$ are conflicting strongly with each other. Notice that when $\alpha_\k=0$, we still ensure that $F$ strictly decreases before converge.

\textbf{Remark on the Controlled Trajectory}
PNG does not control the dynamics by directly minimizing $g(\th)$, which involves the computation of Hessian matrix and thus, unlike traditional constraint optimization, the model might move outside $\P_\epsilon$ and thus \emph{the solution returned by PNG might not be in $\P_\epsilon$. However, PNG successfully controls the trajectory such that at every step, the model is within a good set $\bar{\P}_{\epsilon, \xi L c}$ that is no worse than $\P_\epsilon$ with a slight extra error $\xi L c$ allowed.} This extra loss $\xi L c$ is due to the accumulation of higher order error in each steps and can be controlled by adjusting the learning rate.

\begin{theorem} \label{thm: converge2}
Under the assumptions in Theorem \ref{thm: converge} and further assume Assumption \ref{asm: a} holds, we have
for any $K\in\mathbb{N}\cup\{+\infty\}$,
$$
\min_{k\in[K]} \left\Vert d(\th_{\k},\alpha_{\k})\right\Vert ^{2}\le
\frac{C}{\xi K}, 
%\left \right)$
$$
where $C = 2(F(\th_{0})-F(\th_{\k})+\frac{2c^{3}\xi}{r\epsilon}\sum_{k=0}^{\infty}\alpha_{\k}) < +\infty$. And $\{\th_{\k}\}_{k=0}^{\infty}$ is a convergent sequence.

Besides, denote the limit $\th_{\infty}:=\lim_{k\to\infty}\th_{\k}$. If $g(\th_\infty)\le r\epsilon$, $\nabla F(\th_\infty)=0$; else, there exists $\lambda^{\infty}$ with $\lambda_\i^{\infty}\ge0$
such that $\nabla F(\th_{\infty})+\sum_{\i=1}^{m}\lambda_\i^{\infty}\nabla\ell_\i(\th_{\infty})=0$.
Here $\lambda_\i^{\infty}=\lim_{k\to\infty}\lambda_\i(\th_{\k},\alpha_{\k})$,
$\forall \i\in[m]$.
\end{theorem}
\textbf{Remark on the Convergence Property}
Theorem \ref{thm: converge2} shows that the algorithm converges with $\mathcal{O}(k^{-1})$ rate. And the model either converges to a stationary point of $F$ or a point that satisfies the Karush–Kuhn–Tucker condition of the problem (\ref{opt: relax}) with $\alpha=0$ (intuitively, we cannot further decrease the value of $F$ without increasing any loss $\ell_\i$).
\fi 

