\section{Optimization In Pareto Set} \label{sec: problem}


The Pareto set typically contains an infinite number of points. 
%contains a collection of different candidate models with potential values. 
%However, in practice, we might be only interested in finding special Pareto points 
%that satisfy some user-defined criterion. 
%This  can be formulated as  an
In the \emph{optimization in Pareto set} (OPT-in-Pareto) problem,
we are given an extra criterion function $F(\theta)$ in addition to the objectives $\L$, and 
we want to minimize $F$ in the Pareto set of $\L$, that is, 
%information, we can not decide which 
%for users with different preferences. 
%We consider the problem of \emph{Optimization within Pareto Set} (OPT-in-Pareto), of which the goal is to find a model within the Pareto set that minimizes an given user specified criterion objective $F$:
\begin{eqnarray} \label{equ: main_problem}
\min_{\th\in \P^*}F(\th).
%\min_{\cc}F(\th),~~~~~s.t.~~~~~ \th\in \P^*. 
\end{eqnarray}
%
%\textbf{Singleton Preference}A natural application of \eqref{equ: main_problem} is to quantify the preference of $\th$ based on loss-related criterion $F$. The choice of $F$ can be flexible. 
For example, 
one can find the Pareto point whose loss vector $\L(\cc)$ is the closest to a given reference point $r\in \RR^m$ by choosing 
$F(\th) = \norm{\L(\th) - r }^2$. We can also design $F$ to encourages $\L(\cc)$  to be proportional to $r$, i.e., $\L(\cc)\propto r$; a constraint variant of this problem was considered in 
\citet{mahapatra2020multi}. % considered the called exact Pareto optimization (EPO). 
%which finds a Pareto point that strictly satisfies $\L(\cc)\propto u$ (assuming it is feasible). 
% by defining $F$ to be $F(\cc) = $
%given any template vector $v$ and we want to find $\th \in \P$ such that its performance (i.e., $\L$) is close to $v$, we may choose $F(\th) = ||\L(\th) - v||^2$.

%Another extension is to find some model $\th \in \P$ such that it satisfies a certain constraint $\Omega$ (i.e., we aim to find $\theta\in \Omega\cap\P$). This problem can be easily converted to OPT-in-Pareto by designing a $F$ such that any $\th\in\Omega$ is a global minimizer of $F$. A special case that has been considered in previous work \citep{mahapatra2020multi} is a loss ratio constraint $\Omega=\{\th:r_{1}\ell_{1}(\th)=r_{2}\ell_{2}(\th)=...=r_{m}\ell_{m}(\th)\}$, given some preference vectors $r$. Formulating the problem into OPT-in-Pareto, we can choose the $F$ to be the non-uniformity score used in \citet{mahapatra2020multi} that measures the violation of loss ratio as criterion $F$.


We can further generalize 
OPT-in-Pareto %can be generalized into a multi-model learning system, in which 
to allow the criterion $F$ to depend on an ensemble of Pareto points $\{\th_1, ...,\th_N\}$ jointly, that is, %. It allows us to consider the interaction between models:  % during training by using a criterion objective that depends on all models:
\begin{eqnarray} \label{equ: main_problem_multi}
\min_{\th_{1},...,\th_{N}\in \P^*}F(\th_{1},...,\th_{N}).
%\min_{\th_{1},...,\th_{N}}F(\th_{1},...,\th_{N}),~~~~~s.t.~~~~~
%\th_{1},...,\th_{N}\in \P^*
\end{eqnarray}
For example, if $F(\th_1,\ldots, \th_N)$ measures the diversity among $\{\th_i\}_{i=1}^N$, then optimizing it provides a set of diversified  points inside the Pareto set $\P^*$ yielding a good approximation of $\P^*$.
An example of diversity measure is 
%See also in Section~\ref{sec:energy} 
%Specifically, given a set of models $\{\th_1,...,\th_N\}$, we propose to consider a measure of diversity as follows 
%we propose to use the following energy distance to measure whether the models in $\hat{P}$ are well distributed
\begin{align} \label{eqn: energy}
F(\theta_1, \ldots, \theta_N ) & =E(\L(\cc_1), \ldots, \L(\cc_N)
),
\\
\nonumber
\text{with } E(\L_1, \ldots, \L_N)
& =\sum_{i\neq j}
\left\Vert \L_i-\L_j\right\Vert ^{-2},
\end{align}
where $E$ is known as an \emph{energy distance} in 
computational geometry, 
whose minimizer can be shown to give 
an uniform distribution on manifold asymptotically when $N\to\infty$ 
\citep{hardin2004discretizing}. 
This formulation is particularly useful when the users' preference is unknown during the training time, and we want to return an ensemble of models that well cover the different areas of the Pareto set to allow the users to pick up a model that fits their needs regardless of their preference. 
%There has been a line of recent works 
The problem of profiling Pareto set has attracted 
 a line of recent works  
\citep[e.g.,][]{lin2019pareto,mahapatra2020multi,ma2020efficient,deist2021multi}, but they
rely on specific criterion or heuristics and do not address the general optimization of form \eqref{equ: main_problem_multi}.  
%\qq{discus other Pareto Opt works as well?}
%are restricted to specific 
%This problem can be formulated into OPT-in-Pareto by defining $F$ to measure the diversity between $\{\theta_i\}$ so that we can learn a set of models that are well distributed on the manifold of the Pareto front.
%\red{i feel here if we want to discuss related works }
%Previous attempts\citep{lin2019pareto,mahapatra2020multi,ma2020efficient,deist2021multi} on approximating Pareto set requires prior knowledge of the Pareto front in order to handcraft some heuristic preference rules that are based on indirect measures of the approximation quality. In comparison, the energy distance in \eqref{eqn: energy} is essentially a loss that directly measures the approximation quality and hence introduces a prior-knowledge-free approach with guaranteed optimality. {\color{red} we give more analysis in Appendix xxx}.

%\textbf{Existing Works: 
\paragraph{Manifold Gradient Descent} 
%\textbf{Existing Algorithms for Opt-in-Pareto}
%By viewing $\P$ as a manifold on the original parameter space $\Theta$, a natural and 
One straightforward approach to 
OPT-in-Pareto is to deploy manifold gradient descent \citep{hillermeier2001generalized,bonnabel2013stochastic}, 
which conducts steepest descent of $F(\cc)$ 
in the Riemannian manifold formed by the Pareto set $\P^*$. Initialized at $\th_0\in \P^*$, manifold gradient descent updates $\th_{\k}$ at the $\k$-th iteration along the direction of the projection of $\nabla F(\th_\k)$ on the tangent space $\mathcal T(\th_\k)$ at $\th_\k$ in $\P^*$, % i.e., at step $\k$, we update $\th_\k$ via
\[
\th_{\k+1}=\th_\k-\xi\text{Proj}_{\mathcal T(\th_\k)}(\nabla F(\th_\k)).
\]
By using the stationarity characterization in \eqref{equ: pareto stationary}, under proper regularity conditions, 
one can show that the tangent space $\mathcal T(\th_\k)$ equals the null space of the Hessian matrix $\dd^2_{\cc} \ell_{\omega_\k}(\cc_\k)$, where $\omega_\k = \argmin_{\omega\in\C^m}\norm{\dd_{\cc}\ell_{\omega}(\cc_\k)}$. However, the key issue of manifold gradient descent is the high cost for calculating this null space of Hessian matrix. 
%is the computation of the tangent space $T(\th_\k)$, the calculation of which requires the Hessian matrix w.r.t. the losses of the tasks \citep{hillermeier2001generalized}. 
Although numerical techniques such as Krylov subspace iteration \citep{ma2020efficient} or conjugate gradient descent \citep{koh2017understanding} can be applied, 
the high computational cost (and the complicated implementation) still impedes its application in large scale deep learning problems.
See Section~\ref{sec:intro} for discussions on other related works. % on OPT-in-Pareto. 
%
%\red{Another solution is to view OPT-in-Pareto is to view it as a constrained optimization of minimizing $F(\cc)$, subject to $g(\cc) =0$. However, this again requires Hessian information, and can not differentiation whether we want to minimize or maximize $\L$.}\qq{I actually think this is a promising direction...}
%to reduce the computational cost, %computing Hessian 
%it is still quite expensive in deep learning.

%\red{Opt-in-Pareto can be viewed as a special type of bi-level optimization and has been studied in operation research. %In comparison to its development in the deep learning area, OPT-in-Pareto has been well studied in operation research. 
%However, to the best of our knowledge, most of the existing works  consider a restrictive model assumption such as linearity and the developed algorithms heavily rely on such property, making it hard to generalize to the non-linear, non-convex, and large scale problems in the deep learning application. Examples include \citet{ecker1994optimizing,jorge2005bilinear,thach2014problems,liu2018primal,sadeghi2021solving} (just to name a few). We refer readers to \citet{dempe2018bilevel} for more detailed literature review.} 

\iffalse 
\subsection{Instantiations of OPT-in-Pareto}
OPT-in-Pareto gives an abstraction of the practical scenario that people want to find special models in $\P$ based on certain criteria. Here we discuss two instantiations and refer readers to Appendix {\color{red}XX} for more cases.

\textbf{Singleton Preference}
A natural application of \eqref{equ: main_problem} is to quantify the preference of $\th$ based on loss-related criterion $F$. The choice of $F$ can be flexible. For example, given any template vector $v$ and we want to find $\th \in \P$ such that its performance (i.e., $\L$) is close to $v$, we may choose $F(\th) = ||\L(\th) - v||^2$.

Another extension is to find some model $\th \in \P$ such that it satisfies a certain constraint $\Omega$ (i.e., we aim to find $\theta\in \Omega\cap\P$). This problem can be easily converted to OPT-in-Pareto by designing a $F$ such that any $\th\in\Omega$ is a global minimizer of $F$. A special case that has been considered in previous work \citep{mahapatra2020multi} is a loss ratio constraint $\Omega=\{\th:r_{1}\ell_{1}(\th)=r_{2}\ell_{2}(\th)=...=r_{m}\ell_{m}(\th)\}$, given some preference vectors $r$. Formulating the problem into OPT-in-Pareto, we can choose the $F$ to be the non-uniformity score used in \citet{mahapatra2020multi} that measures the violation of loss ratio as criterion $F$.
% \begin{align} \label{equ: nuf}
% \min_{\th\in P}
% F_\text{NU}
% (\th), &&\text{where}&& \red{F_\text{NU}(\ensuremath{\th})}=\sum_{t=1}^{m}\hat{\ell}_{t}(\th)\log(\frac{\hat{\ell}_{t}(\th)}{1/m}),
% \end{align}

% \subsection{Pareto Subset Optimization}

% A natural application of \eqref{equ: main_problem} 
% is to find some model $\th \in P$ in the Pareto set while it satisfies certain constraint (e.g., $\theta\in \Omega$). This amounts to  optimizing on a Pareto subset $P \cup \Omega$. Pareto subset optimization can be easily converted to OPT-in-Pareto by choosing 

% Special cases of this has been considered in previous works. 
% For example, \citet{mahapatra2020multi} considered a case when we want to learn $\th\in P$ such that its performance on different tasks follows a user-specified ratio, i.e., $\Omega=\{\th:r_{1}\ell_{1}(\th)=r_{2}\ell_{2}(\th)=...=r_{m}\ell_{m}(\th)\}$, given some preference vectors $r$. The algorithm of  \citet{mahapatra2020multi}, called 
% exact Pareto optimization (EPO), 
% solves the problem using a series constraint optimization to learn the updating direction for each iteration in difference phase. When putting into our framework, this problem can be viewed as choosing a preference criterion $F$ as follows: 
% \qq{What does $\text{NU}$ stands for? -- we do not need to follow the notation of EPO paper.}
% \begin{align} \label{equ: nuf}
% \min_{\th\in P}%\text{NU}
% \red{F_\text{NU}}
% (\th), &&\text{where}&& \red{F_\text{NU}(\ensuremath{\th})}=\sum_{t=1}^{m}\hat{\ell}_{t}(\th)\log(\frac{\hat{\ell}_{t}(\th)}{1/m}),
% \end{align}
% \qq{but this is not a constrained optimization as the title promises.} 
% \qq{why do we emphasize it as "constraint"? Could we rename this section as optimizing a "Singleton Preference"?}
% where $\hat{\ell}_{t}$ is the weighted normalization
% $
% \hat{\ell}_{t}(\th)=\frac{r_{t}\ell_{t}(\th)}{\sum_{t'=1}^{m}r_{t'}\ell_{t'}(\th)}.
% $ We find that our simple updating algorithm proposed in Section \ref{sec: algo} is able to recover the functionality of EPO in practice.
% \qq{once again I feel we are getting too much into EPO}
% \qq{I suggest to give a few examples by ourself, and only mention EPO as a related work.}
% \qq{
% %So we can see 
% Title: Optimizing Singleton Preference on Pareto Set. \\
% %
% We first consider the case when the preference $F$ is defined on a single model.  
% There are many different ways to define $F.$ 
% %
% For example, we may have a reference point and we want to minimize the distance, [show equation], we can also have a prefered ratio $r$, and in this case we want to define $F$ in this way [show equation]; the constrained version of this is studied in xx, known as EPO.}
% %this, btw, is related to EPO, which instead formulate the problem as a constrained optimization.}

% While EPO \citep{mahapatra2020multi} is theoretically sound and empirically successful, the derivation of EPO  uses several special property of the preference constraint and thus can't be generalized to solve the problem with a general $\Omega$. In comparison, our algorithm can be applied to solve this generalized problem once we can design a differentiable $F$ such that any $\th\in\Omega$ is the global minimizer of $F$.


\textbf{Pareto Set Approximation with Energy Distance}\label{sec:energy}
If the users' preference is unknown during the training time, we may want to return a set of models that sufficiently cover the different areas of the Pareto set, so that the users can always pick up a model that fits their needs regardless of their preference. This problem can be formulated into OPT-in-Pareto by defining $F$ to measure the diversity between $\{\theta_i\}$ so that we can learn a set of models that are well distributed on the manifold of the Pareto front.
%We study the problem of Pareto stationary set approximation, i.e., approximating the Pareto stationary set with finite number of models. The key task for this approximation is to learn a set of models that are well distributed on the manifold of Pareto front.

%Inspired by the manifold discretization problem in computational
%geometry \citep{hardin2004discretizing}, we find that the well-distributedness of the models on Pareto front can be measured by the energy distance. Considering the multi-model version of OPT-in-Pareto (i.e., problem (\ref{equ: main_problem_multi})), suppose that we use $\hat{P} = \{\th_1,...,\th_N\}$ containing $N$ models to approximate $P$,
Specifically, given a set of models $\{\th_1,...,\th_N\}$, we propose to consider a measure of diversity as follows 
%we propose to use the following energy distance to measure whether the models in $\hat{P}$ are well distributed
\begin{eqnarray} \label{eqn: energy}
{F_{\text{ED}}(\theta_1, \ldots, \theta_N )}:={\sum}_{i\neq j}
%{\th_{1},\th_{2}\in\hat{P},\th_{1}\neq\th_{2}}
\left\Vert \L(\th_{i})-\L(\th_{j})\right\Vert ^{-2}.
\end{eqnarray}
This function is known as a type of energy distance in 
computational geometry \citep{hardin2004discretizing}. The losses $\{\L(\th_i)\}$ produced by the minimizer $\{\theta_i\}$ follow a uniform distribution on the Pareto front and thus give a good approximation. See \citet{hardin2004discretizing} for asymptotic results. 
Previous attempts \citep{lin2019pareto,mahapatra2020multi,ma2020efficient,deist2021multi} on approximating Pareto set requires prior knowledge of the Pareto front in order to handcraft some heuristic preference rules that are based on indirect measures of the approximation quality. In comparison, the energy distance in \eqref{eqn: energy} is essentially a loss that directly measures the approximation quality and hence introduces a prior-knowledge-free approach with guaranteed optimality. {\color{red} we give more analysis in Appendix xxx}.
\fi 

% Previous attempts such as \citet{lin2019pareto,mahapatra2020multi}\qq{no way, these are the only papers we have cited upto page 5} approximates the Pareto stationary set using a set of models with different task preference vectors, which requires a strong prior knowledge of the Pareto front in order to handcraft a good design of the preference. We defer the detailed literature review to Section \ref{sec: review}.\qq{delete this paragraph}

% Inspired by the manifold discretization problem in computational
% geometry \citep{hardin2004discretizing}, we find that the well-distributedness of the models on Pareto front can be measured by the energy distance. Considering the multi-model version of OPT-in-Pareto (i.e., problem (\ref{equ: main_problem_multi})), suppose that we use $\hat{P} = \{\th_1,...,\th_N\}$ containing $N$ models to approximate $P$,
% we propose to use the following energy distance to measure whether the models in $\hat{P}$ are well distributed \qq{do we need to introduce $\hat P$? better to follow the notation in Eq 3.}
% \begin{eqnarray} \label{eqn: energy}
% \text{ED}(\hat{P}):=\sum_{\th_{1},\th_{2}\in\hat{P},\th_{1}\neq\th_{2}}\left\Vert \L(\th_{1})-\L(\th_{2})\right\Vert ^{-2}.
% \end{eqnarray}
% With the use of energy distance function, the models in $\hat{P}$ automatically finds their optimal locations and thus gives good approximation of $P$ without the requirement of knowing any prior knowledge of the Pareto front. We refer reader to \citet{hardin2004discretizing} for some nice asymptotic result of the approximation ability with the use of energy distance.
