\section{Introduction}\label{sec:intro}
Although machine learning tasks are traditionally framed as optimizing a single objective, many modern applications, especially in areas like multitask learning, require finding optimal model parameters to minimize multiple objectives (or tasks) simultaneously. As the different objective functions may inevitably conflict with each other, the notion of optimality in multi-objective optimization (MOO) needs to be characterized by the Pareto set: the set of model parameters whose performance of all tasks cannot be jointly improved.
%\qq{MTL is only an application area of MOO.
 %are 
%Multitask learning (MTL) \citep{caruana1997multitask} aims to use one model to solve multiple related tasks simultaneously. In deep learning, by sharing layers, MTL is an effective way to reduce the overall computational cost and has the potential to improve each task. Different from standard single-task learning, in MTL, the multiple tasks might conflict with each other, i.e., training the model to perform well on one task might degenerate its performance on the others \citep{NEURIPS2018_432aca3a,yu2020gradient}. As a consequence of the task conflict, the optimal solution to MTL is characterized by the Pareto set: a set containing the models whose performance of all tasks cannot be improved jointly.
%\qq{MTL is only an application area of MOO. should not we start with MOO, an import application of which is MTL?}

Focusing on the Pareto set allows us to filter out models that can be strictly improved. However, the Pareto set typically contains an infinite number of parameters that represent different trade-offs of the objectives. 
For $m$ objectives $\ell_1,\ldots, \ell_m$, 
the Pareto set is often an $(m-1)$ dimensional manifold. 
It is both intractable and unnecessary to give practical users the whole exact Pareto set. A more practical demand is to find some user-specified special parameters in the Pareto set, which can be framed into the following \emph{optimization in Pareto set (OPT-in-Pareto)} problem: 

\emph{Finding one or a set of parameters inside the Pareto set of $\ell_1,\ldots, \ell_m$ that minimize a reference criterion $F$.}

Here the criterion function $F$ can be used to encode 
an \emph{informative} 
user-specific preference on the objectives $\ell_1,\ldots, \ell_m$, which allows us to provide the best models customized for different users. $F$ can also be an \emph{non-informative} measure 
that encourages, for example, the diversity of a set of model parameters. In this case, optimizing $F$ in Pareto set gives a set of diversified Pareto models that are representative of the whole Pareto set, from which different users can pick their favorite models during the testing time. 

OPT-in-Pareto provides a highly generic and actionable framework for multi-objective learning and optimization. However, efficient algorithms for solving OPT-in-Pareto have been largely lagging behind in deep learning where the objective functions are non-convex and non-linear. Although has not been formally studied, a straightforward approach is to apply manifold gradient descent on $F$ in the Riemannian manifold formed by the Pareto set \citep{hillermeier2001generalized, bonnabel2013stochastic}. However, this casts prohibitive computational cost due to the need for eigen-computation of Hessian matrices of $\{\ell_\i\}$. In the optimization and operation research literature, there has been a body of work on OPT-in-Pareto viewing it as a special bi-level optimization problem \citep{dempe2018bilevel}. However, these works often heavily rely on the linearity and convexity assumptions and are not applicable to the non-linear and non-convex problems in deep learning; see for examples in
\citet{ecker1994optimizing,jorge2005bilinear,thach2014problems,liu2018primal,sadeghi2021solving} (just to name a few). In comparison, the OPT-in-Pareto problem seems to be much less known and under-explored in the deep learning literature. The exceptions are three works \citep{mahapatra2020multi,kamani2021pareto,chen2021weighted} that propose specialized algorithms for some specific instantiations of the OPT-in-Pareto problem and we defer a more detailed review to Section \ref{sec: review}.


%In comparison to its development in the deep learning area, OPT-in-Pareto has been well studied in operation research. However, to the best of our knowledge, most of the existing works consider a restrictive model assumption such as linearity and the developed algorithms heavily rely on such property, making it hard to generalize to the non-linear, non-convex, and large scale problems in the deep learning application. Examples include \citet{ecker1994optimizing,jorge2005bilinear,thach2014problems,liu2018primal,sadeghi2021solving} (just to name a few). We refer readers to \citet{dempe2018bilevel} for more detailed literature review.

%It seems that 
%is 
%have been underex
%and not 

In this work, we provide a practically efficient first-order algorithm for OPT-in-Pareto, using only gradient information of the criterion $F$ and objectives $\{\ell_i\}$. Our method, named \emph{Pareto navigation gradient descent}  ({\PNG}), iteratively updates the parameters following a direction that carefully balances the descent on $F$ and $\{\ell_i\}$, such that it guarantees to move towards the Pareto set of $\{\ell_i\}$ when it is far away, and optimize $F$ in a neighborhood of the Pareto set. Our method is simple, practically efficient and has theoretical guarantees. 

%when 
%it is close to Pareto
%is to 
%providing practical approximation to 

%In comparison to its development in the deep learning area, OPT-in-Pareto has been well studied in operation research. However, to the best of our knowledge, most of the existing works consider a restrictive model assumption such as linearity and the developed algorithms heavily rely on such property, making it hard to generalize to the non-linear, non-convex, and large scale problems in the deep learning application. Examples include \citet{ecker1994optimizing,jorge2005bilinear,thach2014problems,liu2018primal,sadeghi2021solving} (just to name a few). We refer readers to \citet{dempe2018bilevel} for more detailed literature review.


%Unforuntately, 
%Maximizing a perference 
%In general, people seek a model in the Pareto set since any other model outside the Pareto set can be replaced by a model in the Pareto set with improved performance. On the other hand, the Pareto set contains a large (usually infinite) number of models that behave differently and in practice, 
%People might only be interested in finding one specific model in the Pareto set based on some criterion. For example, given several learned models on the Pareto set, people might be interested in exploring the Pareto set by finding one model that behaves most differently from those learned models. Another example is to find a model in Pareto set such that a certain trade-off between the performances of the tasks is achieved \citep{mahapatra2020multi,kamani2021pareto}. Those problems can be framed into a learning setting called \emph{Optimization in Pareto Set} (OPT-in-Pareto), in which we aim to find a model in the Pareto set such that a certain criterion function is minimized. 

%OPT-in-Pareto is an under-explored research area in deep learning. Most current training algorithms \citep{NEURIPS2018_432aca3a,chen2018gradnorm,kendall2018multi,yu2020gradient,NEURIPS2020_16002f7a,fifty2020measuring} of MTL in the deep learning area are only guaranteed to converge to \emph{some} model in the Pareto set \ref{sec: background} unable to find that \emph{specific} model satisfying the criterion. One exception is \citet{mahapatra2020multi,kamani2021pareto} that searches a model in the Pareto set that satisfies a loss ratio criterion, which is a special instantiation of OPT-in-Pareto. Their approach utilizes the special property of the loss ratio criterion and is not generalizable to solve a general OPT-in-Pareto problem.

% Previous attempts on searching a specific model on Pareto set is limited to the case that we want to find a model of which the performance of each task satisfies a given user preference \qq{sounds unclear, do you mean previous works only find "single" with a "special" metrics, such as ratio?..} \citep{lin2019pareto,mahapatra2020multi}. This problem can be easily abstracted into OPT-in-Pareto by choosing a non-uniformity score measuring the violation of the preference constraint as the criterion function in OPT-in-Pareto. However, on the other hand, OPT-in-Pareto is a much generalized problem compared with \citet{lin2019pareto,mahapatra2020multi} as any form of criterion objective is allowed.
% \qq{the scope of this paragraph reads small: it feels like it is just to differentiate with  \citet{lin2019pareto,mahapatra2020multi}, which feels similar as the paragraph implicitly tell.}

%Although has not been formally studied, OPT-in-Pareto can also be framed as a gradient-descent-on-manifold problem \citep{bonnabel2013stochastic} in which we constrain the parameter space within the Pareto set (viewed as a manifold on the original parameter space). However, applying a manifold gradient descent algorithm can be expensive and unscalable because we need to calculate the Hessian matrix to obtain the tangent space \citep{hillermeier2001generalized} in order to ensure the model is within the manifold during the optimization. In this paper, we propose the first algorithm named Pareto Navigated Gradient Descent (PNG) that is able to return a locally optimal solution to the OPT-in-Pareto problem only using the first-order gradient information. The local optimality is characterized by the Karush–Kuhn–Tucker condition \citep{gordon2012karush} of a local constraint optimization sub-problem.

%{%\color{red}
In empirical studies, 
we demonstrate that our method 
works efficiently for both optimizing user-specific 
criteria and diversity measures. % for finding representative solutions that well cover the whole Pareto set. 
%distribute unform%for providing %can be used to %Our application layer contains three modules. We first verify that PNG is able to solve OPT-in-Pareto given general criterion objectives. We then generalize OPT-in-Pareto into a multi-model interactive system to solve the problem of Pareto set approximation. 
In particular, for finding representative Pareto solutions, 
we propose an energy distance criterion
%to measure the well-distributedness 
%which encourages 
whose minimizers distribute uniformly 
on the Pareto set asymptotically 
\citep{hardin2004discretizing}, 
yielding a principled and efficient Pareto set approximation method that compares favorably with recent works such as
\citet{lin2019pareto,mahapatra2020multi}. %drawn from the computational geometry liter%of modelsand we show that with energy distance as criterion objectives, we are able to use {\PNG} to learn a set of diversified models that well 
We also apply {\PNG} to improve the performance of JiGen \citep{carlucci2019domain}, a multi-task learning approach for domain generalization, by using the adversarial feature discrepancy as the criterion objective. 



