
\section{Practical Implementation} \label{sec: practice}
\paragraph{Hyper-parameters} 
Our algorithm introduces two hyperparameters $\{\alpha_t\}$ and $\ep$ over vanilla gradient descent. We use constant sequence $\alpha_t=\alpha $ and we take $\alpha = 0.5$ unless otherwise specified. 
We choose $\ep$ by $\ep = \gamma   \ep_0$,  
where $\ep_0 $ is an exponentially discounted average of 
$\frac{1}{m}\sum_{\i=1}^{m}\left\Vert \nabla\ell_\i(\cc_t)\right\Vert^2$ over the trajectory so that it automatically scales with the magnitude of the gradients of the problem at hand. In the experiments of this paper, we simply fix $\gamma=0.1$ unless specified.

\paragraph{Solving the Dual Problem}
Our method requires to calculate 
 $\{\lambda_{i,t}\}_{t=1}^m$ with the dual optimization problem in \eqref{equ: dual}, which can be solved with any off-the-shelf convex quadratic programming tool. 
In this work, we use a very simple 
projected gradient descent 
to approximately solve \eqref{equ: dual}.  
We initialize $\{\lambda_{i,t}\}_{t=1}^m$ with a zero vector and terminate when the difference between the last two iterations is smaller than a threshold or the algorithm reaches the maximum number of iterations (we use 100 in all experiments). 

\section{Experiments}
\subsection{Finding Preferred Pareto Models}
\subsubsection{Ratio-based Criterion} \label{appendix_sec: ratio-based preference}
The non-uniformity score from \citep{mahapatra2020multi} that we 
use in Figure~\ref{fig: epo recover} is defined as 
%for the criterion function $F$ is defined as
\begin{align} \label{equ: nuf}
%\min_{\th\in \P}
%{F_\text{NU}}
%(\th), \ \ \text{where}\ \ 
{F_\text{NU}(\ensuremath{\th})}=\sum_{t=1}^{m}\hat{\ell}_{t}(\th)\log\left (\frac{\hat{\ell}_{t}(\th)}{1/m}\right ),~~~~~\ \ \hat{\ell}_{t}(\th)=\frac{r_{t}\ell_{t}(\th)}{\sum_{s\in[m]}r_{s}\ell_{s}(\th)}. 
\end{align}

We fix the other experiment settings the same as \citet{mahapatra2020multi} and use $\gamma=0.01$ and $\alpha=0.25$ for this experiment reported in the main text. We defer the ablation studies on the hyper-parameter $\alpha$ and $\gamma$ to Section \ref{appendix_sec: dynamics}.

\subsubsection{ZDT2-Variant} \label{appendix_sec: zdt}
We consider the ZDT2-Variant example used in \citet{ma2020efficient} with the same experiment setting, in which the Pareto set is a cylindrical surface, making the problem more challenging. We consider the same criteria, e.g. weighted distance and complex cosine used in the main context with different choices of $r_1 = [0.2, 0.4, 0.6, 0.8]$. We use the default hyper-parameter set up, choosing $\alpha=0.5$ and $r=0.1$. For complex cosine, we use MGD updating for the first 150 iterations. Figure \ref{fig: zdt_general} shows the trajectories, demonstrating that PNG works pretty well for the more challenging ZDT2-Variant tasks.

\begin{figure}
%\begin{wrapfigure}{r}{0.6\textwidth}
\begin{centering}
\includegraphics[scale=0.4]{fig/general/zdt2v_template_PNG.pdf}
\includegraphics[scale=0.4]{fig/general/zdt2v_complex cos_PNG.pdf}
\caption{Trajectories of solving OPT-in-Pareto with weighted distance and complex cosine as criterion using PNG. The green dots are the final converged models. PNG is able to successfully locate the correct models in the Pareto set.} \label{fig: zdt_general}
\end{centering}
\end{figure}
%\end{wrapfigure}

\subsubsection{General Criteria: Three-task learning on the NYUv2 Dataset} \label{appendix_sec: mtan}
We show that PNG is able to handle large-scale multitask learning problems by deploying it on a three-task learning problem (segmentation, depth estimation, and surface normal prediction) on NYUv2 dataset \citep{silberman2012indoor}. The main goal of this experiment is to show that: 1. PNG is able to handle OPT-in-Pareto in a large-scale neural network; 2. With a proper design of criteria, PNG enables to do targeted fine-tuning that pushes the model to move towards a certain direction. We consider the same training protocol as \citet{liu2019end} and use the MTAN network architecture. Start with a model trained with equally weighted linear scalarization and our goal is to further improve the model’s performance on segmentation and surface normal estimation while allowing some sacrifice on depth estimation. This can be achieved by many different choices of criterion and in this experiment, we consider the following design: 
$
F(\cc)= (\ell_{\text{seg}}(\cc) \times \ell_{\text{surface}}(\cc)) / (0.001 + \ell_{\text{depth}}(\cc)).
$
Here $\ell_{\text{seg}}$, $\ell_{\text{surface}}$ and $\ell_{\text{depth}}$ are the loss functions for segmentation, surface normal prediction and depth estimation, respectively. The constant 0.001 in the denominator is for numeric stability. We point out that our design of criterion is a simple heuristic and might not be an optimal choice and the key question we study here is to verify the functionality of the proposed PNG. As suggested by the open-source repository of \citet{liu2019end}, we reproduce the result based on the provided configuration. To show that PNG is able to move the model along the Pareto front, we show the evolution of the criterion function and the norm of the MGD gradient during the training in Figure \ref{fig: mtan_traj}. As we can see, PNG effectively decreases the value of criterion function while the norm of MGD gradient remains the same. This demonstrates that PNG is able to minimize the criterion by searching the model in the Pareto set. Table \ref{tbl: mtan} compares the performances on the three tasks using standard training and PNG, showing that PNG is able to improve the model’s performance on segmentation and surface normal prediction tasks while satisfying a bit of the performance in depth estimation based on the criterion.

\begin{figure}
%\begin{figure}
\begin{centering}
\includegraphics[scale=0.4]{fig/general/mtan_traj.pdf}
\par\end{centering}
\caption{The evolution of Criterion $F$ and the norm of MGD gradient when trained using PNG on NYUv2 dataset with MTAN network. PNG effectively decreases the criterion while ensuring the model is within the Pareto set, since the norm of MGD gradient remains unchanged.} \label{fig: mtan_traj}
\end{figure}

%\begin{wrapfigure}{l}{0.5\textwidth}
%\begin{centering}
%\includegraphics[scale=0.5]{fig/general/mtan_traj.pdf}
%\par\end{centering}
%\caption{The evolution of Criterion and Norm of MGD gradient when trained using PNG.} \label{fig: mtan_traj}
%\end{wrapfigure}

\subsection{Finding Diverse Pareto Models} \label{appendix_sec: pareto approximation}
\subsubsection{Experiment Details} \label{appendix_sec: pareto approximation exp}
We train the model for 100 epochs using Adam optimizer with batch size 256 and 0.001 learning rate. To encourage diversity of the models, following the setting in \citet{mahapatra2020multi}, we use equally distributed preference vectors for linear scalarization and EPO. Note that the stochasticity of using mini-batches is able to improve the performance of Pareto approximation for free by also using the intermediate checkpoints to approximate $\P$. To fully exploit this advantage, for all the methods, we collect checkpoints every epoch to approximate $\P$, starting from epoch 60.
\subsubsection{Evaluation Metric Details} \label{appendix_sec: pareto approximation metric}
We introduce the definition of the used metric for evaluation. Given a set $\hat{\P} = \{\theta_1,\ldots, \cc_N\}$ that we use to approximate  $\P$, its IGD+ score is defined as: 
\[
\text{IGD$_+$}(\hat{\P})=\int_{\P^*}
q(\th,\hat{\P})d\mu(\th),\ \ \ \  q(\th,\hat{\P})=\min_{\hat{\th}\in\hat{\P}}\left\Vert \left(\L(\hat{\th})-\L(\th)\right)_{+}\right\Vert,
\]
where $\mu$ is some base measure that measures the importance of
$\th\in \P$ and $(t)_+\defeq \max(t,0)$, applied on each element of a vector. Intuitively, for each $\th$, we find a
nearest $\hat{\th}\in\hat{\P}$ that approximates $\th$ best. Here
the $(\cdot)_{+}$ is applied as we only care the tasks that $\hat{\th}$
is worse than $\th$. In practice, a common choice of $\mu$ can be
a uniform counting measure with uniformly sampled (or selected) models
from $\P$. In our experiments, since we can not sample models from $\P$, we approximate $\P$ by combining $\hat{\P}$ from all the methods, {\color{black} i.e., $\P\approx\cup_{m\in\text{\{Linear,MGD,EPO,PNG\}}}\hat{\P}_{m}$, where $\hat{P}_m$ is the approximation set produced by algorithm $m$.}

This approximation might not be accurate but is sufficient to compare the different methods,  


The Hypervolume score of $\hat \P$, w.r.t. a reference point $\L^r\in \RRplus^m$,  is defined as 
\[
\text{HV}(\hat{\P}) = \mu\left(\left\{ \L=[\ell_{1},...,\ell_{m}]\in\mathbb{R}^{m}\mid\exists\th\in\hat{\P},\ \text{s.t.}\ \ell_{t}(\th)\le\ell_{t}\le\ell_{t}^{r}\ \forall t\in[m]\right\} \right),
\]
where $\mu$ is again some measure. We use $\L^r=[0.6, 0.6]$ for calculating the Hypervolume based on loss and set $\mu$ to be the common Lebesgue measure. Here we choose 0.6 as we observe that the losses of the two tasks are higher than 0.6 and 0.6 is roughly the worst case. When calculating Hypervolume based on accuracy, we simply flip the sign.

\begin{table}
\begin{centering}
\begin{tabular}{c|ccccccccc}
\toprule
\multirow{3}{*}{\makecell[c]{Algorithm}} & \multicolumn{2}{c}{Segmentation} & \multicolumn{2}{c}{Depth} & \multicolumn{5}{c}{Surface Normal}\tabularnewline
\cline{2-10} \cline{3-10} \cline{4-10} \cline{5-10} \cline{6-10} \cline{7-10} \cline{8-10} \cline{9-10} \cline{10-10} 
 & \multicolumn{2}{c}{(Higher Better)} & \multicolumn{2}{c}{(Lower Better)} & \multicolumn{2}{c}{\makecell[c]{Angle Distance \\ (Lower Better)}} & \multicolumn{3}{c}{Within $t^{\circ}$}\tabularnewline
 & mIoU & Pix Acc & Abs Err & Rel Err & Mean & Median & 11.25 & 22.5 & 30\tabularnewline
\hline 
Standard & 27.09 & 56.36 & 0.6143 & 0.2618 & 31.46 & 27.37 & 19.51 & 41.71 & 54.61\tabularnewline
PNG & 28.23 & 56.66 & 0.6161 & 0.2632 & 31.06 & 26.50 & 21.06 & 43.41 & 55.93\tabularnewline
\bottomrule
\end{tabular}\caption{Comparing the multitask performance of standard training using linear scalarization with equally weighted losses and the targeted fine-tuning based on PNG.} \label{tbl: mtan}
\par\end{centering}
\end{table}

\subsubsection{Ablation Study} \label{appendix_sec: pareto approximation abl}
We conduct ablation study to understand the effect of $\alpha$ and $\gamma$ using the Pareto approximation task on Multi-Mnist. We compare PNG with $\alpha=0.25, 0.5, 0.75$ and $\gamma=0.01, 0.1, 0.25$. Figure \ref{table: ablation} summarizes the result. Overall, we observe that PNG is not sensitive to the choice of hyper-parameter.

\begin{table}
\begin{centering}
\begin{tabular}{c|c|cccc}
\toprule
\multicolumn{2}{c|}{} & \multicolumn{2}{c}{Loss} & \multicolumn{2}{c}{Acc}\tabularnewline
\multicolumn{1}{c}{} &  & Hv$\uparrow$ ($10^{-2}$) & IGD$\downarrow$ ($10^{-2}$) & Hv$\uparrow$ ($10^{-2}$) & IGD$\downarrow$ ($10^{-2}$)\tabularnewline
\hline 
\multirow{3}{*}{$\gamma=0.1$} & $\alpha=0.25$ & $7.89\pm0.11$ & $0.041\pm0.012$ & $9.39\pm0.038$ & $0.0056\pm0.002$\tabularnewline
 & $\alpha=0.5$ & $7.86\pm0.12$ & $0.043\pm0.012$ & $9.39\pm0.038$ & $0.0056\pm0.002$\tabularnewline
 & $\alpha=0.75$ & $7.84\pm0.11$ & $0.045\pm0.013$ & $9.38\pm0.037$ & $0.0057\pm0.002$\tabularnewline
\hline 
\multirow{3}{*}{$\alpha=0.5$} & $\gamma=0.01$ & $7.86\pm0.12$ & $0.042\pm0.012$ & $9.39\pm0.038$ & $0.0056\pm0.002$\tabularnewline
 & $\gamma=0.1$ & $7.86\pm0.12$ & $0.043\pm0.012$ & $9.39\pm0.038$ & $0.0056\pm0.002$\tabularnewline
 & $\gamma=0.25$ & $7.85\pm0.11$ & $0.042\pm0.012$ & $9.39\pm0.036$ & $0.0056\pm0.002$\tabularnewline
\bottomrule
\end{tabular}
\par\end{centering}
\caption{Ablation study based on Multi-Mnist dataset with different choice of $\alpha$ and $\gamma$.} \label{table: ablation}
\end{table}

\subsubsection{Comparing with the Second Order Approach} \label{appendix_sec: compare second order}
We give a discussion on comparing our approach with the second order approaches proposed by \citet{ma2020efficient}. In terms of algorithm, \citet{ma2020efficient} is a local expansion approach. To apply \citet{ma2020efficient}, in the first stage, we need to start with several well distributed models (i.e., the ones obtained by linear scalarization with different preference weights) and \citet{ma2020efficient} is only applied in the second stage to find the neighborhood of each model. The performance gain comes from the local neighbor search of each model (i.e. the second stage).

In comparison, PNG with energy distance is a global search approach. It improves the well-distributedness of models in the first stage (i.e. it’s a better approach than simply using linear scalarization with different weights). And thus the performance gain comes from the first stage. Notice that we can also apply \citet{ma2020efficient} to PNG with energy distance to add extra local search to further improve the approximation.

In terms of run time comparison. We compare the wall clock run time of each step of updating the 5 models using PNG and the second order approach in \citet{ma2020efficient}. We calculate the run time based on the multi-MNIST dataset using the average of 100 steps. PNG uses 0.3s for each step while \citet{ma2020efficient} uses 16.8s. PNG is \emph{56x} faster than the second order approach. And we further argue that, based on time complexity theory, the gap will be even larger when the size of the network increases.


\subsection{Trajectory Visualization with Different Hyper-parameters} \label{appendix_sec: dynamics}
We give visualization on the PNG trajectory when using different hyper-parameters. We reuse synthetic example introduced in Section \ref{sec: subset application} for studying the hyper-parameters $\alpha$ and $\gamma$. We fix $\alpha=0.25$ and vary $\gamma = 0.1, 0.05, 0.01, 0.1$; and fix $\gamma=0.01$ and vary $\alpha=0.1, 0.25, 0.5, 0.75$. Figure \ref{fig: epo recover ablation} plots the trajectories. As we can see, when $\gamma$ is properly chosen, with different $\alpha$, PNG finds the correct models with different trajectories. Different $\alpha$ determines the algorithm's behavior of balancing the descent of task losses or criterion objectives. On the other hand, with too large $\gamma$, the algorithm fails to find a model that is close to $\P^*$, which is expected.

\begin{figure}
\begin{centering}
\includegraphics[scale=0.32]{fig/ablation/NPOalpha_0.25thre_0.1_recover.pdf}\hspace{-0.6cm}
\includegraphics[scale=0.32]{fig/ablation/NPOalpha_0.25thre_0.05_recover.pdf}\hspace{-0.6cm}
\includegraphics[scale=0.32]{fig/ablation/NPOalpha_0.25thre_0.01_recover.pdf}\hspace{-0.6cm}
\includegraphics[scale=0.32]{fig/ablation/NPOalpha_0.25thre_0.001_recover.pdf}

\includegraphics[scale=0.32]{fig/ablation/NPOalpha_0.1thre_0.01_recover.pdf}\hspace{-0.6cm}
\includegraphics[scale=0.32]{fig/ablation/NPOalpha_0.25thre_0.01_recover.pdf}\hspace{-0.6cm}
\includegraphics[scale=0.32]{fig/ablation/NPOalpha_0.5thre_0.01_recover.pdf}\hspace{-0.6cm}
\includegraphics[scale=0.32]{fig/ablation/NPOalpha_0.75thre_0.01_recover.pdf}
\par\end{centering}
\caption{Ablation study on OPT-in-Pareto with different ratio constraint of objectives. Upper row, from left to right: fixing $\alpha=0.25$, $\gamma=0.1, 0.05, 0.01, 0.001$; Lower row, from left to right: fixing $\gamma=0.01$, $\alpha=0.1, 0.25, 0.5, 0.75$. By comparing the figures in the first row, we find that choosing a too large $\gamma$ make the final converged model be far away from the Pareto set, which is as expected. By comparing the figures in the second row, we find that changing $\alpha$ make PNG give different priority in making Pareto improvement or descent on $F$. When $\alpha$ is larger (the right figures), PNG will first move the model to Pareto set and start to decrease $F$ after that.} \label{fig: epo recover ablation}
\end{figure}

\subsection{Improving Multitask Based Domain Generalization} \label{appendix_sec: dg}

We argue that many other deep learning problems also have the structure
of multitask learning when multiple losses presents and thus optimization
techniques in multitask learning can also be applied to those domains. In this paper we consider the JiGen \citep{carlucci2019domain}.
JiGen learns
a model that can be generalized to unseen domain by minimizing a standard
cross-entropy loss $\ell_{\text{class}}$ for classification and an
unsupervised loss $\ell_{\text{jig}}$ based on Jigsaw Puzzles: 
\[
\ell(\th)=(1-\omega)\ell_{\text{class}}(\th)+\omega\ell_{\text{jig}}(\th).
\]
The ratio between two losses, i.e. $\omega$, is important to the final performance of the model and requires a careful grid search. Notice that JiGen is essentially searching for a model on the Pareto front using the linear scalarization. Instead of using a fixed linear scalarization to learn a model, one natural questions is that whether it is possible to design a mechanism that dynamically adjusts the ratio of the losses so that we can achieve to learn a better model.

We give a case study here. Motivated by the
adversarial feature learning \citep{JMLR:v17:15-239}, we propose to improve JiGen such that the latent feature representations of the two tasks are well aligned. Specifically, suppose that {\color{black}$\Phi_{\text{class}}(\th)=\{\phi_{\text{class}}(x_{i},\th)\}_{i=1}^{n}$ and $\Phi_{\text{jig}}(\th)=\{\phi_{\text{jig}}(x_{i},\th)\}_{i=1}^{n}$
is the distribution of latent feature representation of the two tasks, where $x_i$ is the $i$-th training data.}
We consider $F_{\text{PD}}$ as some probability metric that measures
the distance between two distributions, we consider the following
problem: 
{\color{black}
\[
\min_{\th\in \P^*}F_{\text{PD}}[\Phi_{\text{class}}(\th),\Phi_{\text{jig}}(\th)].
\]
}
With $\text{PD}$ as the criterion function, our algorithm automatically reweights the ratio of the two tasks such that their latent space is well aligned.

\begin{table}[t]
\begin{centering}
\scalebox{0.92}{
\begin{tabular}{c|cccc|c}
\toprule
Method & Art paint & Cartoon & Sketches & Photo & Avg\tabularnewline
\hline 
\multicolumn{6}{c}{AlexNet}\tabularnewline
\hline 
TF & $0.6268$ & $0.6697$ & $0.5751$ & $0.8950$ & $0.6921$\tabularnewline
CIDDG & $0.6270$ & $0.6973$ & $0.6445$ & $0.7865$ & $0.6888$\tabularnewline
MLDG & $0.6623$ & $0.6688$ & $0.5896$ & $0.8800$ & $0.7001$\tabularnewline
D-SAM & $0.6387$ & $0.7070$ & $0.6466$ & $0.8555$ & $0.7120$\tabularnewline
DeepAll & $0.6668$ & $0.6941$ & $0.6002$ & $0.8998$ & $0.7152$\tabularnewline
\hline 
JiGen & $0.6855\pm0.004$ & $\pmb{0.6889\ensuremath{\pm}0.002}$ & $\pmb{0.6831\ensuremath{\pm}0.011}$ & $0.8946\pm0.008$ & $0.7380\pm0.002$\tabularnewline
JiGen + adv & $0.6857\pm0.004$ & $0.6837\pm0.003$ & $0.6753\pm0.008$ & $0.8980\pm0.001$ & $0.7357\pm0.003$\tabularnewline
Jigen + PNG & $\pmb{0.6914\ensuremath{\pm}0.005}$ & $\pmb{0.6903\ensuremath{\pm}0.002}$ & $\pmb{0.6855\ensuremath{\pm}0.007}$ & $\pmb{0.9044\ensuremath{\pm}0.003}$ & $\pmb{0.7429\ensuremath{\pm}0.002}$\tabularnewline
\hline 
\multicolumn{6}{c}{ResNet-18}\tabularnewline
\hline 
D-SAM & $0.7733$ & $0.7243$ & $0.7783$ & $0.9530$ & $0.8072$\tabularnewline
DeepAll & $0.7785$ & $0.7486$ & $0.6774$ & $0.9573$ & $0.7905$\tabularnewline
\hline 
JiGen & $0.8009\pm0.004$ & $0.7363\pm0.007$ & $0.7046\pm0.013$ & $\pmb{0.9629\ensuremath{\pm}0.002}$ & $0.8012\pm0.002$\tabularnewline
JiGen + adv & $0.7923\pm0.006$ & $0.7402\pm0.004$ & $0.7188\pm0.005$ & $0.9617\pm0.001$ & $0.8033\pm0.001$\tabularnewline
JiGen + PNG & $\pmb{0.8014\ensuremath{\pm}0.005}$ & $\pmb{0.7538\ensuremath{\pm}0.001}$ & $\pmb{0.7222\ensuremath{\pm}0.006}$ & $\pmb{0.9627\ensuremath{\pm}0.002}$ & $\pmb{0.8100\ensuremath{\pm}0.005}$\tabularnewline
\bottomrule
\end{tabular}
}
\par\end{centering}
\caption{Comparing different algorithms for domain generalization using dataset PACS and two network architectures. The setting is the same to that of Table \ref{tbl: domain_small}.} \label{tbl: dg}
\end{table}

\textbf{Setup} We fix all the experiment setting the same as \citet{carlucci2019domain}. We use the Alexnet and Resnet-18 with multihead pretrained on ImageNet as the multitask network. We evaluate the methods on PACS \citep{Li_2017_ICCV}, which covers 7 object categories and 4 domains (Photo, Art Paintings, Cartoon and Sketches). Same to \citet{carlucci2019domain}, we trained our model considering three domains as source datasets and the remaining one as target. We implement $F_\text{PD}$ that measures the discrepancy of the feature space of the two tasks using the idea of Domain Adversarial Neural Networks \citep{ganin2015unsupervised} by adding an extra prediction head on the shared feature space to predict the whether the input is for the classification task or Jigsaw task. {\color{black}Specifically, we add an extra linear layer on the shared latent feature representations that is trained to predict the task that the latent space belongs to, i.e.,
\[
F_{\text{PD}}(\Phi_{\text{class}}(\th),\Phi_{\text{jig}}(\th))=\min_{w,b}\frac{1}{n}\sum_{i=1}^{n}\log(\sigma(w^{\top}\phi_{\text{class}}(x_{i},\th)))+\log(1-\sigma(w^{\top}\phi_{\text{class}}(x_{i},\th))).
\]
Notice that the optimal weight and bias for the linear layer depends on the model parameter $\th$, during the training, both $w,b$ and $\th$ are jointly updated using stochastic gradient descent. We follow the default training protocol provided by the source code of \citet{carlucci2019domain}. 
}


\textbf{Baselines} Our main baselines are JiGen \citep{carlucci2019domain}; JiGen + adv, which adds an extra domain adversarial loss on JiGen; and our PNG with domain adversarial loss as criterion function. In order to run statistical test for comparing the methods, we run all the main baselines using 3 random trials. We use the released source code by \citet{carlucci2019domain} to obtained the performance of JiGen. For JiGen+adv, we use an extra run to tune the weight for the domain adversarial loss. Besides the main baselines, we also includes TF \citep{Li_2017_ICCV}, CIDDG \citep{li2018deep}, MLDG \citep{li2018learning} , D-SAM \citep{d2018domain} and DeepAll \citep{carlucci2019domain} as baselines with the author reported performance for reference.

\textbf{Result} The result is summarized in Table \ref{tbl: dg} with bolded value indicating the statistical significant best methods with p-value based on matched-pair t-test less than 0.1. Combining Jigen and PNG to dynamically reweight the task weights is able to implicitly regularizes the latent space without adding an actual regularizer which might hurt the performance on the tasks and thus improves the overall result.


