%The inner product for the vector $\ell_2$ space is denoted as $\left\langle \cdot,\cdot\right\rangle$. 
%Given some linear space $S$, we denote the projection of a vector $v$ onto $S$ as $\text{Proj}_S(v)$. 
%m{\color{red}We define $(\cdot)_{+}=\max(\cdot,0)$ and given some vector $v\in\mathbb{R}^{n}$, we define $(v)_{+}=\left[(v_{1})_{+},...,(v_{n})_{+}\right]^{\top}$.}


%The inner product for the vector $\ell_2$ space is denoted as $\left\langle \cdot,\cdot\right\rangle$. 
%Given some linear space $S$, we denote the projection of a vector $v$ onto $S$ as $\text{Proj}_S(v)$. 
%m{\color{red}We define $(\cdot)_{+}=\max(\cdot,0)$ and given some vector $v\in\mathbb{R}^{n}$, we define $(v)_{+}=\left[(v_{1})_{+},...,(v_{n})_{+}\right]^{\top}$.}

\section{Background on Multi-objective Optimization} 
\label{sec: background}
%multi-task Learning}
%\subsection{Pareto Optimality} 
We introduce the background on multi-objective optimization (MOO) and Pareto optimality. 
%\textbf{Notation}
For notation, we denote by $[m]$ the integer set $\{1,2,....,m\}$, and 
$\RRplus$  the set of non-negative real numbers. 
Let $\C^m = \left\{\omega\in \RRplus^m,~~\sum_{\i=1}^m \omega_\i = 1\right\}$ be the probability simplex.
%
We denote by $\left\Vert \cdot\right\Vert$ the 
Euclidean norm. 

%
Let $\th \in \RR^\dimcc$ be a parameter of interest (e.g., the weights in a deep neural network). 
Let $\L(\cc)=[\ell_1(\cc),\ldots, \ell_m(\cc)]$ be a set of objective functions that we want to minimize. 
%
%Because the different objectives may conflict with each other, it is typically impossible to find a $\cc$ that simultaneously optimize all the objectives. 
%The goal of MOO is to find solutions on the Pareto set of the objectives. 
%
%
%Consider a deep learning model with parameter $\th$ within parameter space $\Theta\subseteq\mathbb{R}^{n}$ for solving $m$ tasks. In the multi-task learning (MLT) setting, we have $m$ different tasks, each of which is associated with a loss function $\ell_\i(\theta),~~\forall \i\in [m]$ and the goal is to find a good $\th$ that gives good performance on all the tasks. Different from standard learning problems with only one task, tasks in MTL can be conflicting with each other, and thus performing well on one task might degrade the model performance on the others. The optimality of the solution to a MTL problem is characterized by the Pareto set.
%\paragraph{Pareto Set} 
%We write $\()$
%^Denote $\L(\th)=[\ell_{1}(\th),...,\ell_{m}(\th)]^{\top}$. We
For two parameters $\th,\th'\in \RR^\dimcc$,
we write $\L(\cc) \succeq \L(\cc')$ if $\ell_\i(\cc) \geq \ell_\i(\cc')$ for all $\i \in [m]$; 
and write  $\L(\cc) \succ \L(\cc')$  if  
$\L(\cc) \succeq \L(\cc')$ and $\L(\cc) \neq \L(\cc')$. 
We say that $\th$  is Pareto dominated (or Pareto improved)  by $\th'$ if  $\L(\cc) \succ \L(\cc')$.  
%(denoted as $\L(\th_{1})\dsqe\L(\th_{2})$) iff $\th_{1}$ performs 
%no better than $\th_{2}$ in any task and there is at least onetask that $\th_{2}$ performs strictly better than $\th_{1}$, i.e.,
%\begin{align*} 
%\L(\th_{1})\dsqe\L(\th_{2}) 
%&&\iff &&
%\ell_\i(\th_{1})\ge\ell_\i(\th_{2}),\ \forall \i\in[m]\ \text{and}\ ~~~  \L(\cc_1) \neq \L(\cc_2). 
%\exists \i\in[m],\ \ell_\i(\th_{1})>\ell_\i(\th_{2}).
%\end{align*}
%
%A point $\theta$ is said to be
We say that $\cc$ is Pareto optimal on a set $\Theta\subseteq \RR^\dimcc$, denoted as $\theta\in \mathrm{Pareto}(\Theta)$, if
%if no point in $\Theta$ dominates  $\cc$, that is,
%it is 
there exists no $\theta' \in \Theta$ such that $\L(\th)\dsqe\L(\th')$. 
%
%We say that $\cc$ is local Pareto optimal (in $\RR^\dimcc$) if there exists a neighborhood $\mathcal N_\theta$ of $\cc$, such that $\cc$ is Pareto optimal on $\mathcal N_\cc$. 
%A point $\cc$ 

The Pareto global optimal set  $\P^{**} \defeq \mathrm{Pareto}(\RR^{\dimcc})$  
is the set of points (i.e., $\cc$) which are Pareto optimal on the whole domain $\RR^\dimcc$.  
The Pareto local optimal set %(which we simply call Pareto set sometimes) 
of $\L$, denoted by $\P^{*}$, 
%%is defined as 
is the set of points which are Pareto optimal on a neighborhood of itself: 
%all Pareto optimal models: 
\begin{align*}
\P^{*}:=\{\th\in\RR^\dimcc: ~~
& \text{there exists a neighborhood $\mathcal N_\cc$ of $\cc$, }
\\
& \text{such that $\cc\in \mathrm{Pareto}(\mathcal N_{\cc})$} \}. 
\end{align*}
The (local or global) Pareto front 
%is defined accordingly by 
is the set of objective vectors 
achieved  by the Pareto optimal points, e.g., %$\cc$ in $\P^*$. 
the local Pareto front is 
$\mathcal F^* = \{\L(\th):\th\in\P^*\}$. 
%
Because finding global Pareto optimum 
is intractable for non-convex objectives in deep learning, 
%objective functions in deep learning are almost always non-convex, making it intractable to access or achieve global Pareto optimality, 
we focus on  Pareto local optimal sets in this work; 
in the rest of the paper, terms like ``Pareto set'' and ``Pareto optimum'' refer to Pareto local optimum by default. 
%to mean Pareto local optimal by default. 
%refer 
%\red{and drop the term ``local'' for convenience. }
%refer Pareto local optimal as use Pareto optimal for concern . 

\paragraph{Pareto Stationary Points} 
Similar to the case of single-objective optimization, 
Pareto local optimum implies a notion of Pareto stationarity defined as follows. Assume $\L$ is differentiable on $\RR^\dimcc$. A point $\cc$ is called Pareto stationary if there must exists a set of non-negative weights  $\omega_1,\ldots, \omega_m$ with $\sum_{\i=1}^m \omega_\i = 1$, such that $\cc$ is a stationary point of the $\omega$-weighted linear combination of the objectives: $\ell_{\omega}(\cc)\defeq \sum_{\i=1}^m \omega_\i \ell_\i(\cc).$ 
%One can show that if $\cc$ is Pareto optimal on $\RR^\dimcc$, then it must be Pareto stationary.  
Therefore, the set of Pareto stationary points, denoted by $\P$, 
can be characterized by
\begin{align}\label{equ: pareto stationary}
    \P&:=\left\{ \th\in\Theta:g(\th)=0\right\}
    \\
    \nonumber
    g(\th)&:=\min_{\omega\in\C^m}||\sum_{\i=1}^m \omega_\i\nabla\ell_\i(\th)||^2, 
\end{align}
%  \bbb 
%  \label{equ: pareto stationary}
% \P:=\left\{ \th\in\Theta:g(\th)=0\right\}, &
%   %g(\th):=\min_{\omega\in\C^m}\norm{\sum_{\i=1}^m \omega_\i\nabla\ell_\i(\th)}^2, &&
%  g(\th):=\min_{\omega\in\C^m}||\sum_{\i=1}^m \omega_\i\nabla\ell_\i(\th)||^2, 
%  %~~~~~~~~
%  %\C^m = \left\{\omega\in \RRplus^m,~~\sum_{\i=1}^m \omega_\i = 1\right\},
% \eee 
where $g(\cc)$ is the minimum squared gradient norm of $\ell_{\omega}$ among all $\omega$ in the probability simplex $\C^m$ on $[m]$. 
Because $g(\th)$ can be calculated in practice, 
it provides an essential way to access Pareto local optimality. Being a Pareto stationary point is a necessary condition of being a Pareto local optimum.
%and $\cc$ is Pareto stationary 

%\red{remove this paragraph:}
%Because the loss functions $\L$ are non-convex in deep learning applications, it is difficult to find the Pareto optimal set (corresponding to global minimizers). 
%We instead focus on  local descent algorithms which can only guarantee to find points that are Pareto stationary.  \qq{we "do not focus on Pareto stationary points"; we focus on finding local optimal (whose concept we did not introduce) with gradient descent, but our analysis only give gurantee in terms statoinary.}

%Pareto optimal set can be viewed as `global minimizer' of the multi-taskproblem, which, however, is usually difficult to obtain using gradientbased algorithm due to the non-convexity of the optimization in deep learning. We instead consider the `local minimizer' of the multi-task problem,which can be characterized by the following Pareto stationary set$\P$ \citep{desideri2012multiple}: \begin{align} \label{equ: pareto stationary}\P:=\left\{ \th\in\Theta:g(\th)=0\right\} ,\ \ \ \ g(\th):=\min_{\lambda\in\C^m}\Vert \textstyle{\sum}_{\i\in[m]}\lambda_\i\nabla\ell_\i(\th)\Vert^2, \end{align}
%{where $\C^m$ is the probability simplex, that is,  $\C^m:=\{\lambda \in\mathbb{R}^{m}:\lambda_\i\ge0,\ \forall \i\in[m]\ \text{and}\ \sum_{t=1}^{m}\lambda_\i=1\}$}. The Pareto front $:=\{\L(\th):\th\in\P\}$ is defined accordingly by examing the losses given by the models in $\P$.
% It can be shown that, for any $\th\in P$, there exists
% $\epsilon>0$ such for for any $\th'$ such that $\left\Vert \th'-\th\right\Vert \le\epsilon$,
% $\th$ is not strictly dominated by $\th'$ \qq{should not we need additional high order information to ensure this?}.

%\subsubsection{Preliminary: Representative Algorithms in MTL}
%\subsubsection{Finding a Single Point on Pareto Set}

\paragraph{Finding  Pareto Optimal Points}  
%We review two representative algorithms that is able to converge to some model that is in $\P$.
%\textbf{Linear Scalarization}
%To find a Pareto optimal point, 
A main focus of the MOO literature is to find a (set of) Pareto optimal points. 
The simplest approach is \emph{linear scalarization}, 
which minimizes $\ell_{\omega}$ for some weight $\omega$ (decided, e.g., by the users) in $\C^m$. 
%to learn a model $\th \in \P$ is linear
%scalarization, which minimizes the weighted average of the losses %with some preference coefficients $\{\omega_\i\}$. 
%\begin{align} \label{equ:linear}
%\ell(\th)=\textstyle{\sum}_{t=1}^{m}\omega_\i\ell_\i(\th). 
%\end{align}%$\sum_\i\lambda_\i \ell_\i$. 
%Obviously, the optimum of \eqref{equ:linear} for any $\{\omega_\i\}$lies in $P$.
However, %this requires knowledge 
%on how to pick $\omega$, and more importantly, can only find the points on 
%Pareto optimal points on the %linear scalarization is only able to converge to model producing losses $\L(\th)$ that are on 
%linearis
linear scalarization can only find Pareto points that lie  on 
the \emph{convex envelop} of the Pareto front   \citep[see e.g.,][]{boyd2004convex}, and hence does not give a complete profiling of the Pareto front when the objective functions (and hence their Pareto front) are non-convex.


%Given some user specified preference vector $\omega\in\C^{m}$,the linear scalarization algorithm combines the losses of the multipletasks into a single loss using a weighted sum, i.e., \[\ell(\th)=\sum_{t=1}^{m}\omega_\i\ell_\i(\th).\] And thus at convergence, we have $\left\Vert \ell(\th)\right\Vert =\left\Vert \sum_{t=1}^{m}\omega_\i\ell_\i(\th)\right\Vert =0$,which ensures $\th$ at convergent belongs to $P$. Despite its simplicity, 
%when the Pareto front is non-convex, the 
%linear scalarization can only find the points on the convex envelop of the Pareto front, and hence does not provide a full characterization of Pareto optimality. See \cite{boyd2004convex} for more details.

\emph{Multiple gradient descent (MGD)} \citep{desideri2012multiple} %overcomes the profiling issue of linear scalarization. 
is an gradient-based algorithm that can 
converge to a Pareto local optimum that lies on either the convex or non-convex parts of the Pareto front, depending on the initialization.  %, although does not provide a ma
%Multiple gradient descent (MGD) \citep{desideri2012multiple} is  another typical algorithm that
%converges to the Pareto stationary set. 
%MGD is an iterative local descent algorithm that 
MGD starts from some initialization $\cc_0$ and updates $\cc$ at the $\k$-th iteration by 
\begin{align}
\label{equ: update mgd} 
\cc_{\k+1} & \gets \cc_\k - \xi v_\k,
\\
\nonumber
v_\k & \defeq 
\argmax_{v\in \RR^\dimcc}
\left\{ 
 \min_{\i\in[m]} \dd \ell_\i(\cc_\k)\tt v 
-\frac{1}{2}\norm{v} ^{2}
\right\},
\end{align}
where $\xi$ is the step size  and 
$v_\k$ is 
an 
update direction that maximizes 
the \emph{worst} descent rate among all objectives, since 
%ensures \emph{all} the losses $\{\ell_\i\}$ are simultaneously decreased. This is because 
%\begin{align} \label{equ: update mgd}
%d(\th):=\arg\max_{d\in \mathbb{R}^n} \ \min_{\i\in[m]}\left\langle d,\nabla\ell_\i(\th)\right\rangle-1/2\left\Vert d\right\Vert ^{2},
%\end{align}
$%\min_{\i\in[m]} %\nabla\ell_\i(\th_\k)\tt v 
 %\approx 
 \nabla\ell_\i(\th_\k)\tt v  \approx 
 %\min_\i
 (\ell_\i(\theta_\k) - \ell_\i(\theta_\k-\xi v))/\xi$
 approximates the descent rate of objective $\ell_\i$ 
 %when following 
 %along %
 when following direction $v$.  
 %when $\xi$ is small. 
 %denotes the
 %\emph{worst} decreasing rate among all losses $\{\ell_\i\}$ when following direction $v$. 
% Using the strong Lagrange duality, the solution of the above problem
% is 
% \begin{align*}
% d(\th) & =\sum_{t=1}^{m}\lambda_\i(\th)\nabla\ell_\i(\th),\ \lambda(\th) =\arg\min_{\lambda\in\C^{m}}\left\Vert \sum_{t=1}^{m}\lambda_\i\nabla\ell_\i(\th)\right\Vert .
% \end{align*}
% Here $\lambda(\th)$ can be obtained by standard quadratic program
% solver.
%
When using a sufficiently small step size $\xi$, MGD ensures to yield a \emph{Pareto improvement} (i.e, decreasing all the objectives) on $\cc_\k$ unless $\cc_\k$ is Pareto (local) optimal; this is because the optimization in \eqref{equ: update mgd} always yields $\min_{\i\in[m]} \dd \ell_\i(\cc_\k)\tt v_\k\leq 0$ (otherwise we can simply flip the sign of $v_\k$). 

Using Lagrange strong duality, the solution of \eqref{equ: update mgd} can be framed into 
\begin{align} \label{equ:mgd_dual}
v_\k & = \sum_{\i=1}^m \omega_{\i,\k} \dd \ell_\i(\cc_\k),
\\
\nonumber
\text{where }
  \{\omega_{\i,\k}\}_{\i=1}^m
  & =\arg\min_{\omega\in\C^{m}}\norm{ 
  \dd_\cc \ell_{\omega} (\cc_t) }.
\end{align}
It is easy to see from \eqref{equ:mgd_dual} 
that the set of fixed points of MDG (which satisfy $v_\k=0$) 
coincides with the Pareto stationary set $\P^*$. 

A key disadvantage of MGD, however, 
is that the Pareto point 
that it converges to 
depends on the initialization and other algorithm configurations in a rather implicated and complicated way. 
It is difficult to explicitly control MGD to make it converge to points with specific properties. 


%Approximation of MGD in deep learning that reduces the computational cost includes \citet{NEURIPS2018_432aca3a}. Although MGD is able to converge to both convex and non-convex parts of the Pareto front, there is no explicit way to control which point in Pareto stationary set it will land to and thus it is generally hard to apply MGD to find a model $\th \in \P$ that satisfies some user criterion.

% Different from linear scalarization, MGD is able to converge to any $\th\in P$\qq{1) MGD does not gurantee to converge any points, so I would say something like "can converge to both convex and non-convex parts"; 2) we should make it clear that the final points it converges to the initilaization; and there is no explicit way to control which point it will land to -- this is a rather non-standard thing that people say MGD the first time would wonder; recall what you thought when you first see MGD.}. 
% \qq{overall, i feel the discussion here is a bit brief, but depends on space.}
% \paragraph{Pareto Stationary set Exploration}

% Besides learning a model $\th\in P$, we can also perturb $\th$ such
% that it travels on $P$. It has been theoretically shown in {[}xxx{]}
% and empirically verified in {[}xxx{]} that by perturbing $\th$ along
% any direction $d$ that belongs to the tangent space of $\th$. As
% shown by {[}xx{]}, the tangent space $T(\th)$ is
% \begin{align*}
% T(\th) & =\text{colspan}\left\{ H(\th)^{-1}\nabla\ell_\i(\th)\right\} _{t=1}^{m},\\
% H(\th) & :=\sum_{t=1}^{m}\lambda_\i^{*}\nabla^{2}\ell_\i(\th),\\
% \lambda^{*} & =\arg\min_{\lambda\in\C^{m}}\left\Vert \sum_{t=1}^{m}\lambda_\i\nabla\ell_\i(\th)\right\Vert .
% \end{align*}
