\begin{figure*}[t]
\begin{centering}
\vspace{-0.1cm}
\hspace{-10mm}
\includegraphics[scale=0.35]{fig/recover/EPO_recover.pdf}\hspace{-5mm}
\includegraphics[scale=0.35]{fig/recover/NPOalpha_0.25thre_0.01_recover.pdf}\hspace{-5mm}
\includegraphics[scale=0.35]{fig/general/templatealpha_0.5thre_0.1_NPO.pdf}\hspace{-5mm}
\includegraphics[scale=0.35]{fig/general/complex cosalpha_0.5thre_0.1_NPO.pdf}\hspace{-10mm}
\par\end{centering}
\vspace{-0.4cm}
%\vspace{-1\baselineskip}
\begin{tabular}{cccc}
\hspace{.1\textwidth}
\small (a) \hspace{.19\textwidth} & 
\small (b) \hspace{.175\textwidth} & 
\small (c)  \hspace{.175\textwidth} & 
\small (d)  \hspace{.19\textwidth}
\end{tabular}
\vspace{-1.2\baselineskip}
\caption{(a)-(b): the trajectory of finding Pareto models that satisfy different ratio constraints (shown in different colors) on the two objectives $\ell_1,\ell_2$ using EPO and PNG; 
we can see that PNG can achieve the same goal as EPO (with different trajectories) while being a more general approach. 
(c)-(d): the trajectory of finding Pareto models that minimize the weighted distance and complex cosine criterion using PNG. The green dots indicate the converged models. We can see that PNG can successfully locate the correct Pareto models that minimize different criteria.}\label{fig: epo recover}
\vspace{-0.5cm}
\end{figure*}

\section{Empirical Results} 
We introduce three applications of OPT-in-Pareto with {\PNG}: Singleton Preference, Pareto approximation and improving multi-task based domain generalization method. We also conduct additional study on how the learning dynamics of PNG changes with different hyper-parameters ($\alpha_t$ and $\ep$), which are included in Appendix \ref{appendix_sec: dynamics}. Other additional results that are related to the experiments in Section \ref{sec: subset application} and \ref{sec: exp approx} and are included in the Appendix will be introduced later in their corresponding sections. Code is available at \url{https://github.com/lushleaf/ParetoNaviGrad}.
\subsection{Finding Preferred Pareto Models} \label{sec: subset application}
We consider the synthetic example used in \citet{lin2019pareto, mahapatra2020multi}, which consists of two losses:
$\ell_{1}(\th)=1-\exp(-\left\Vert \th-\eta \right\Vert ^{2})$ and $\ell_{2}(\th)=1-\exp(-\left\Vert \th+\eta \right\Vert ^{2})$, where
$\eta = n^{-1/2}$ and 
$\dimcc=10$ is dimension of the parameter $\cc$. 
%$\ell_{1}(\th)=1-\exp(-\left\Vert \th-\dimcc^{-1/2}\right\Vert ^{2})$ and $\ell_{2}(\th)=1-\exp(-\left\Vert \th+\dimcc^{-1/2}\right\Vert ^{2})$, where $\dimcc=10$ is dimension of the parameter.

\paragraph{Ratio-based Criterion}
We first show that {\PNG} can solve the search problem under the ratio constraint of objectives in \citet{mahapatra2020multi}, i.e., finding a point $\th\in \P^*\cap \Omega$ with $\Omega=\{\th:r_{1}\ell_{1}(\th)=r_{2}\ell_{2}(\th)=...=r_{m}\ell_{m}(\th)\}$, given some preference vector $r=[r_1,...,r_m]$. We apply {\PNG} with the non-uniformity score defined in \citet{mahapatra2020multi} as the criterion, and compare with their algorithm called exact Pareto optimization (EPO).  
We show in  Figure \ref{fig: epo recover}(a)-(b) the trajectory of {\PNG} and EPO 
 for searching models with different preference vector $r$, starting from the same randomly initialized point. 
 %for searching models with different preference vector $r$, starting from the same random initialized point (an algorithm for solving this specific ratio-based preference OPT-in-Pareto problem \citep{mahapatra2020multi}). %
Both {\PNG} and EPO converge to the correct solutions but with different trajectories. {\color{black} This suggests that PNG is able to achieve the same functionality of finding ratio-constraint Pareto models as \citet{mahapatra2020multi,kamani2021pareto} do but being versatile to handle general criteria.} We refer readers to Appendix \ref{appendix_sec: ratio-based preference} for more results with different choices of hyper-parameters and the experiment details.

\paragraph{Other Criteria} 
We demonstrate that {\PNG} is able to find solutions for general choices of $F$. We consider the following designs of $F$:
1) weighted $\ell_2$ distance w.r.t. a reference vector $r\vv\in \RRplus^m$, that is, 
%in which we want a model in $\P^*$ such that its performance on tasks is close to a given reference vector in weighted $\ell_2$ distance, i.e.,
$F_{\text{wd}}(\th)=\sum_{i=1}^m (\ell_{i}(\th)-r_i)^{2}/r_i$; 
and 2) complex cosine: in which $F$ is a complicated function related to the cosine of task objectives, i.e., $F_{\text{cs}}=-\cos\left(\pi(\ell_{1}(\th)-r_1)/2\right) +(\text{cos}(\pi(\ell(\cc_{2})-r_2)) + 1)^2$. {\color{black} Here the weighted $\ell_2$ distance can be viewed as finding a Pareto model that has the losses close to some target value $r$, which can be viewed as an alternative approach to partition the Pareto set. The design of complex cosine aims to test whether PNG is able to handle a very non-linear criterion function.} In both cases, we take 
%$r_=[\eta,1-\eta]$ with 
$r_1 =[0.2, 0.4, 0.6, 0.8]$ and $r_2 = 1-r_1$.  
We show in Fig \ref{fig: epo recover}(c)-(d) 
the trajectory of {\PNG}. As we can see, {\PNG} is able to correctly find the optimal solutions of OPT-in-Pareto. We also test {\PNG} on a more challenging ZDT2-variant used in \citet{ma2020efficient} and a larger scale MTL problem \citep{liu2019end}, for which we refer readers to Appendix \ref{appendix_sec: zdt} and \ref{appendix_sec: mtan}. 

\begin{table*}[t]
\begin{centering}
\vspace{-0.3cm}
\scalebox{1.0}{
\begin{tabular}{c|c|cc|cc}
\toprule
\multirow{2}{*}{Data} & \multirow{2}{*}{Method} & \multicolumn{2}{c|}{Loss} & \multicolumn{2}{c}{Acc}\tabularnewline
 &  & HV$\uparrow$ ($10^{-2}$) & IGD+$\downarrow$ ($10^{-2}$) & HV$\uparrow$ ($10^{-2}$) & IGD+$\downarrow$ ($10^{-2}$)\tabularnewline
\hline 
\multirow{4}{*}{Multi-MNIST} & Linear & $7.48\pm0.11$ & $0.142\pm0.034$ & $9.27\pm0.024$ & $0.036\pm0.0084$\tabularnewline
 & MGD & $7.69\pm0.10$ & $0.051\pm0.011$ & $9.27\pm0.023$ & $0.008\pm0.0010$\tabularnewline
 & EPO & $\pmb{7.87\ensuremath{\pm}0.16}$ & $0.069\pm0.028$ & $9.17\pm0.032$ & $0.065\pm0.0181$\tabularnewline
 & {\PNG} & $\pmb{7.86\ensuremath{\pm}0.11}$ & $\pmb{0.042\ensuremath{\pm}0.012}$ & $\pmb{9.39\ensuremath{\pm}0.036}$ & $\pmb{0.006\ensuremath{\pm}0.0022}$\tabularnewline
\hline 
\multirow{4}{*}{Multi-Fashion} & Linear & $0.38\pm0.059$ & $0.127\pm0.013$ & $4.76\pm0.019$ & $0.064\pm0.012$\tabularnewline
 & MGD & $0.42\pm0.064$ & $0.046\pm0.016$ & $4.77\pm0.019$ & $\pmb{0.023\ensuremath{\pm}0.003}$\tabularnewline
 & EPO & $0.36\pm0.058$ & $0.308\pm0.109$ & $4.78\pm0.030$ & $0.211\pm0.020$\tabularnewline
 & {\PNG} & $\pmb{0.47\ensuremath{\pm}0.066}$ & $\pmb{0.016\ensuremath{\pm}0.002}$ & $\pmb{4.81\ensuremath{\pm}0.021}$ & $\pmb{0.023\ensuremath{\pm}0.003}$\tabularnewline
\hline 
\multirow{4}{*}{Fashion-MNIST} & Linear & $5.01\pm0.057$ & $0.167\pm0.054$ & $8.46\pm0.046$ & $0.110\pm0.035$\tabularnewline
 & MGD & $5.09\pm0.069$ & $0.060\pm0.029$ & $8.40\pm0.045$ & $\pmb{0.049\ensuremath{\pm}0.011}$\tabularnewline
 & EPO & $4.60\pm0.166$ & $0.233\pm0.054$ & $8.12\pm0.041$ & $0.385\pm0.077$\tabularnewline
 & {\PNG} & $\pmb{5.27\ensuremath{\pm}0.054}$ & $\pmb{0.048\ensuremath{\pm}0.027}$ & $\pmb{8.53\ensuremath{\pm}0.047}$ & $\pmb{0.046\ensuremath{\pm}0.022}$\tabularnewline
\bottomrule
\end{tabular}
}
\par\end{centering}
\caption{
Results of approximating the Pareto set 
by different methods on three MNIST benchmark datasets. The numbers in the table are the averaged value and the standard deviation. 
Bolded values indicate the statistically significant best result with p-value less than 0.5 based on matched pair t-test.} \label{tbl: mnist}
\end{table*}

\begin{table*}
\begin{centering}
\scalebox{1.0}{
\begin{tabular}{c|cccc|c}
\toprule
PACS & art paint & cartoon & sketches & photo & Avg\tabularnewline
\hline 
D-SAM & $0.7733$ & $0.7243$ & $0.7783$ & $0.9530$ & $0.8072$\tabularnewline
DeepAll & $0.7785$ & $0.7486$ & $0.6774$ & $0.9573$ & $0.7905$\tabularnewline
\hline 
JiGen & $0.8009\pm0.004$ & $0.7363\pm0.007$ & $0.7046\pm0.013$ & $\pmb{0.9629\ensuremath{\pm}0.002}$ & $0.8012\pm0.002$\tabularnewline
JiGen+adv & $0.7923\pm0.006$ & $0.7402\pm0.004$ & $0.7188\pm0.005$ & $0.9617\pm0.001$ & $0.8033\pm0.001$\tabularnewline
JiGen+PNG & $\pmb{0.8014\ensuremath{\pm}0.005}$ & $\pmb{0.7538\ensuremath{\pm}0.001}$ & $\pmb{0.7222\ensuremath{\pm}0.006}$ & $\pmb{0.9627\ensuremath{\pm}0.002}$ & $\pmb{0.8100\ensuremath{\pm}0.005}$\tabularnewline
\bottomrule 
\end{tabular}
}
\par\end{centering}
\caption{Comparing different methods for domain generalization on PACS using ResNet-18. The values in table are the testing accuracy with its  standard deviation. The bolded values are the best models with p-value less than $0.1$ based on match-pair t-test.}\label{tbl: domain_small}
\vspace{-0.3cm}
\end{table*}

\subsection{Finding Diverse Pareto Models} \label{sec: exp approx}

% \begin{figure}
% \begin{centering}
% \includegraphics[scale=0.16]{fig/toy/constrain_linearinit0.pdf}\hspace{-0.2cm}
% \includegraphics[scale=0.16]{fig/toy/constrain_linearinit5.pdf}\hspace{-0.2cm}
% \includegraphics[scale=0.16]{fig/toy/constrain_linearinit10.pdf}\hspace{-0.2cm}
% \includegraphics[scale=0.16]{fig/toy/constrain_linearinit30.pdf}
% \end{centering}
% \\
% \begin{centering}
% \includegraphics[scale=0.16]{fig/toy/constrain0.pdf}\hspace{-0.2cm}
% \includegraphics[scale=0.16]{fig/toy/constrain20.pdf}\hspace{-0.2cm}
% \includegraphics[scale=0.16]{fig/toy/constrain50.pdf}\hspace{-0.2cm}
% \includegraphics[scale=0.16]{fig/toy/constrain150.pdf}
% \end{centering}
% \caption{Evolution of models from different initialization. Upper row uses initialization with linear scalarization and lower row uses initialization from MDG. From left to right: the evolution of models during training. PNG is robust to initializations. In both two cases of very poor initialization, PNG is still able to move the models so that they are eventually well distributed on the Pareto set.} \label{fig: engergy_toy}
% \end{figure}

\begin{figure}
\begin{centering}
\includegraphics[scale=0.13]{fig/parato_app.pdf}
\end{centering}
\vspace{-0.3cm}
\caption{Evolution of models from different initializations. Upper row starts with models at the boundary of the Pareto set. Lower row considers clustered initializations.} \label{fig: engergy_toy}
\vspace{-0.5cm}
\end{figure}

\paragraph{Synthetic Examples}
We reuse the synthetic example introduced in Section \ref{sec: subset application}. We consider learning 5 models to approximate the Pareto front staring from two types of extremely bad initializations.  Specifically, in the upper row of Figure \ref{fig: engergy_toy}, we consider initializing the models using linear scalarization. Due to the concavity of the Pareto front, linear scalarization can only learns models at the two extreme end of the Pareto front. The second row uses MGD for initialization and the models is scattered at an small region of the Pareto front. Different from the algorithm proposed by \citet{lin2019pareto} which relies on a good initialization, using the proposed energy distance function, PNG pushes the models to be equally distributed on the Pareto Front without the need of any prior information of the Pareto front even with extremely bad starting point.

\paragraph{Multi-MNIST Benchmark} We consider the problem of finding diversified points from the Pareto set by minimizing the energy distance criterion in \eqref{eqn: energy}. 
We use the same setting as \citet{lin2019pareto,mahapatra2020multi}. We consider three benchmark datasets: (1) MultiMNIST, (2) MultiFashion, and (3) MultiFashion+MNIST. For each dataset, there are two tasks (classifying the top-left and bottom-right images). We consider LeNet with multihead and train $N=5$ models to approximate the Pareto set. For baselines, we compare with 
linear scalarization, MGD \citep{NEURIPS2018_432aca3a}, and EPO \citep{mahapatra2020multi}. % and the proposed {\PNG} with energy function in \eqref{eqn: energy}.
For the MGD baseline, we find that naively running it leads to poor performance as the learned models are not diversified and thus we initialize the MGD with 60-epoch runs of linear scalarization with equally distributed preference weights and runs MGD for the later 40 epoch. We refer the reader to Appendix \ref{appendix_sec: pareto approximation exp} for more details of the experiments.

We measure the quality of how well the found models $\{\theta_1,\ldots,\theta_N\}$ approximate the Pareto set using two standard metrics: Inverted Generational Distance Plus (IGD+) \citep{ishibuchi2015modified} and  hypervolume (HV) \citep{zitzler1999multiobjective}; see Appendix \ref{appendix_sec: pareto approximation metric} for their definitions. We run all the methods with 5 independent trials and report the averaged value and its standard deviation in Table \ref{tbl: mnist}. We report the scores calculated based on loss (cross-entropy) and accuracy on the test set. The bolded values indicate the best result with p-value less than 0.05 (using matched pair t-test). In most cases, {\PNG} improves the baselines by a large margin. We include ablation studies in Appendix \ref{appendix_sec: pareto approximation abl} and additional comparisons with the second-order approach proposed by \citet{ma2020efficient} in Appendix \ref{appendix_sec: compare second order}.

\subsection{Application to Multi-task based Domain Generalization Algorithm}
JiGen \citep{carlucci2019domain} learns a domain generalizable model by learning two tasks based on linear scalarization, which essentially searches for a model in the Pareto set and requires choosing the weight of linear scalarization carefully. It is thus natural to study whether there is a better mechanism that dynamically adjusts the weights of the two losses so that we eventually learn a better model. Motivated by the
adversarial feature learning \citep{JMLR:v17:15-239}, we propose to improve JiGen such that the latent feature representations of the two tasks are well aligned. This can be framed into an OPT-in-Pareto problem where the criterion is the discrepancy of the latent representations (implemented using an adversarial discrepancy module in the network) of the two tasks. PNG is applied to solve the optimization. We evaluate the methods on PACS \citep{Li_2017_ICCV}, which covers 7 object categories and 4 domains (Photo, Art Paintings, Cartoon, and Sketches). The model is trained on three domains and tested on the rest of them. Our approach is denoted as JiGen+PNG and we also include JiGen + adv, which simply adds the adversarial loss as regularization and two other baseline methods (D-SAM \citep{d2018domain} and DeepAll \citep{carlucci2019domain}). For the three JiGen based approaches, we run 3 independent trials and for the other two baselines, we report the results in their original papers. Table \ref{tbl: domain_small} shows the result using ResNet-18, which  demonstrates 
the improvement by the application of the OPT-in-Pareto framework. We also include the results using AlexNet in the Appendix. Please see Appendix \ref{appendix_sec: dg} for the additional results and more experiment details. 


