% \documentclass{uai2022} % for initial submission
\documentclass[accepted]{uai2022} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2022} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2022} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

%%%%%%%%%%

\usepackage{balance}
\usepackage{wrapfig}
\usepackage{soul}
\usepackage{pifont}
\usepackage{amsthm}
\usepackage{amsmath,amsfonts,amssymb}
\usepackage{mathtools}
\usepackage{multirow}
\usepackage{subcaption}
\usepackage{algorithm}
\usepackage{algorithmicx}
\usepackage[noend]{algpseudocode}
\usepackage{comment}
\usepackage{enumitem}
\usepackage{xspace}
\usepackage{epigraph}

\usepackage{xcolor}
\hypersetup{
    colorlinks=true,
    linkcolor=teal,
    citecolor=teal,
    urlcolor=teal
}

\newcommand*{\eg}{\textit{e.g.,}\@\xspace}
\newcommand*{\ie}{\textit{i.e.,}\@\xspace}
% \newcommand*{\st}{\textit{s.t.}\@\xspace}
\newcommand*{\vs}{\textit{vs.}\@\xspace}
\newcommand*{\wrt}{\textit{w.r.t.}\@\xspace}
\makeatletter
\newcommand*{\etc}{%
	\@ifnextchar{.}%
	{\textit{etc}}%
	{\textit{etc.}\@\xspace}%
}
% \renewcommand\@makefntext[1]{%
% \setlength\parindent{1em}%
% \noindent
% \mbox{\@thefnmark.~}{#1}}
\makeatother

\algtext*{EndWhile}
\algtext*{EndFor}
\algtext*{EndIf}
\algdef{SE}[DOWHILE]{Do}{doWhile}{\algorithmicdo}[1]{\algorithmicwhile\ #1}%

%%%%%%%%%%%%%% Math garbage
\newtheorem{theorem}{Theorem}
\newtheorem{definition}{Definition}
\newtheorem{corollary}{Corollary}[theorem]
\newtheorem{lemma}{Lemma}[theorem]
\newtheorem{proposition}{Proposition}
\newtheorem*{remark}{Remark}
\newtheorem{case}{Case}

%%%%%%%%%%%%%%%% Operator garbage
\DeclareMathOperator*{\argmax}{arg\,max}
\DeclareMathOperator*{\argmin}{arg\,min}
\DeclareMathOperator{\Tr}{Tr}

\makeatletter
\def\BState{\State\hskip-\ALG@thistlm}
\makeatother

\newcommand{\cmark}{{\color{blue}\ding{51}}}%
\newcommand{\xmark}{{\color{red}\ding{55}}}

\newcommand\numberthis{\addtocounter{equation}{1}\tag{\theequation}}

% ML 4/23/22
% http://mirrors.ibiblio.org/CTAN/macros/latex/contrib/changes/changes.english.pdf

%\usepackage[final]{changes}
% ---EXAMPLES---
% \added{text to add}
% \replaced{new text}{old text}
% \deleted{text to cut}.

% 2/19/22: todonotes don't play well with UAI package, overlap text, so switching
% to inline highlighlting for now
% \setlength{\marginparwidth}{0.5cm}
\usepackage[textsize=scriptsize,backgroundcolor=green,textwidth=15mm]{todonotes}
\newcommand\sg[1]{\todo{{\bf SG}: #1}} % Soumya
\newcommand\ml[1]{\todo{{\bf ML}: #1}} % Matt
% \newcommand\sg[1]{\hl{{\bf SG}: #1}} % Soumya
% \newcommand\ml[1]{\hl{{\bf ML}: #1}} % Matt

% \usepackage{changes}

\newcommand\D{\mathcal{D}}
\newcommand\s{\mathcal{S}}

\newcommand{\etal}{et al.~}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%%%%%%%%%%

\title{Learning a Neural Pareto Manifold Extractor with Constraints (Supplementary material)}

% The standard author block has changed for UAI 2022 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{Soumyajit Gupta}
\author[2]{Gurpreet Singh}
\author[3]{Raghu Bollapragada}
\author[4]{Matthew Lease}
% Add affiliations after the authors
\affil[1]{%
    %smjtgupta@utexas.edu, 
    Department of Computer Science, University of Texas at Austin, USA
}
\affil[2]{%
    %gurpreet@xtractorai.com, 
    XtractorAI
}
\affil[3]{%
    %raghu.bollapragada@utexas.edu, 
    Operations Research and Industrial Engineering, University of Texas at Austin, USA
}
\affil[4]{%
    %ml@utexas.edu, 
    School of Information, University of Texas at Austin, USA 
}
% \affil[5]{%
%     University of Texas at Austin, USA
% }
  
\begin{document}
\maketitle

\section{SUHNPF as HyperNetwork} \label{app:hypernet}
\vspace{-1em}

\textbf{Fig. \ref{fig:framework}} shows the overview of SUHNPF as a hypernetwork tasked with optimizing the weights of the target neural classifier. The input to the neural classifier are data points $Y$ and the output are matched against labels $Z_{1,true},Z_{2,true}$ for two different tasks. The weights of the neural classifier are $\Theta$ and SUHNPF as a hypernetwork approximates the weak Pareto manifold $\tilde{M}(\theta^*)$ for optimal trade-off over different values $\alpha$ for the two MOO losses $\mathcal{L}_1,\mathcal{L}_2$.

\begin{figure}[h]
    \centering
    \vspace{-0.5em}
    \includegraphics[width=0.8\linewidth]{figs/SUHNPF.pdf}
    \vspace{-1em}
    \caption{\small Framework for extracting the Pareto optimal front $\tilde{M}(\Theta)$ of a given target model $C_{\Theta}$ (which could also be a non-neural model: Decision Tree, Logistic Regression, \textit{etc}.)}
    % \vspace{-1em}
    \label{fig:framework}
\end{figure}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\vspace{-1em}
\section{SUHNPF \vs Other MTL methods} \label{app:comp}
\vspace{-1em}

\textbf{Point based solvers}. Most MTL methods, including 
MOOMTL, % \citep{sener2018multi}, 
PMTL, % \citep{lin2019pareto}, 
EPSE, % \citep{ma2020efficient}, 
and EPO %\citep{mahapatra2020multi} 
are point based solvers. Being point-based, they return one solution per run, relying upon specialized local initialization to generate an even spread of Pareto points, using cones, rays, or other domain partitioning strategies, across the feasible set of saddle points. Thus asked for $P$ Pareto candidates, these solvers would have to run for $P$ instances. Later, if the user demands $2P$ points, they have to run for $2P$ instances from scratch, without utilizing the results from the previous run.

\textbf{Manifold based solvers}. A manifold-based solution strategy should separate Pareto \vs non-Pareto points, without requiring any special initialization. It would also be able to extract $2P$ Pareto candidates while being trained to generate $P$ candidates, due to interpolating from the learned boundary. This is highly advantageous over point-based schemes for deployment of practical systems, where the expected user trade-off preference is not known \textit{a priori}, hence good to have the full approximated front. Notably, \citet{navon2021learning} and \citet{lin2021controllable} are the only prior manifold based Pareto solvers that we are aware of that are also scalable to optimize large neural models. Another advantage is that both SUHNPF and EPO (used by PHN in backend) solvers have a user-specified error tolerance criteria built in, while other MTL solver lack it and therefore run a specified number of iterations before declaring a candidate Pareto, without actually checking for optimality. 

\textbf{Full rank indicator \vs low rank regressor}. A manifold based solver should also generalize to cases where the manifold is an implicit function as opposed to its easier counterpart of being an explicit function. SUHNPF has an added advantage in extracting the weak Pareto manifold as an $k$-dimensional diffusive indicator function as opposed to a $(k-1)$-dimensional manifold itself, where the regressed manifold is not only guided by the weak Pareto points (indicator value 1) but also the sub-optimal points (indicator value 0) for a more robust and accurate extraction. Thus it can generally approximate the manifold, irrespective of the manifold being an explicit or implicit function. In comparison, PHN learns a $(k-1)$-dimensional regression manifold, given solution points obtained from EPO or LS. Therefore, PHN's default assumption is that the Pareto manifold is always an explicit function \ie for $k$ objectives, the Pareto manifold is of dimension $k-1$.

% NOTE: necessary when ptmx or no mathfont class option is given
\providecommand{\upGamma}{\Gamma}
\providecommand{\uppi}{\pi}

\vspace{-1em}
\section{Discussion on Remark 1}\label{app:fjc}
\vspace{-1em}

\textbf{Remark}: \small \textit{If $f_i$s are continuous and differentiable once, in an unconstrained setting, then the set of weak pareto optimal points are $x^*=\{x|det(L(x)^TL(x))=0\}$, for a non-square matrix $L(x)$, and is equivalent to $x^*=\{x|det(L(x))=0\}$ for a square matrix $L(x)$.}

% \vspace{-1em}
% \subsection{Illustration}
% \vspace{-1em}

We begin by considering an unconstrained MOO for ease of description. Let us consider the following two quadratic functions, convex in both variables as:
\begin{align*}
f_{1}(\mathbf{x}) = (x_{1}-1)^{2} + (x_{2}-1)^2\\
f_{2}(\mathbf{x}) = (x_{1}+1)^2 + (x_{2}+1)^2
\end{align*}

The task is to find the Pareto front between the objectives $f_{1}(\mathbf{x})$ and $f_{2}(\mathbf{x})$. For this problem the Pareto front is known \textit{a priori} as the straight line $x_{1} = x_{2}$ for $x_{1} \in [-1,1]$ in the variable domain. Let us now first plot $f_{1}$ \vs $f_{2}$ for visual assessment in \textbf{Fig. \ref{fig:parabola}}.
\begin{figure}[ht]
    \centering
    \vspace{-1em}
    \includegraphics[width=0.6\linewidth]{figs/parabola.png}
    \vspace{-1em}
    \caption{\small Functional Domain plot for two competing objectives.}
    \vspace{-1em}
    \label{fig:parabola}
\end{figure}

Note that independent of each other:
\vspace{-0.5em}
\begin{align*}
    \nabla f_{1}(\mathbf{x})& = \left[\frac{\partial f_{1}}{\partial x_{1}} \quad \frac{\partial f_{1}}{\partial x_{2}}\right]^{T} = \mathbf{0} \,\, \mathrm{at} \,\, (x_{1},x_{2}) = (1,1)\\
    \nabla f_{2}(\mathbf{x}) &= \left[\frac{\partial f_{2}}{\partial x_{1}} \quad \frac{\partial f_{2}}{\partial x_{2}}\right]^{T} = \mathbf{0} \,\, \mathrm{at} \,\, (x_{1},x_{2}) = (-1,-1)
\end{align*}

One can easily confirm that the gradient matrix $L$ cannot be identically zero for any value of $x \in \mathbb{R}^{2}$.
\vspace{-0.5em}
\begin{align*}
    L = [\nabla f_{1}(x) \quad \nabla f_{2}(x)]
\end{align*}



% We already know that $L$ or the gradient matrix cannot be identically zeros as discussed before. 
To avoid a trivial solution the vector $[\alpha_{1} \quad \alpha_{2}]$ must also not be identically zero. This becomes clear if the scalarized function $S(\mathbf{x})=\alpha_1 f_1 + \alpha_2 f_2$ is defined, where $\alpha_1+\alpha_2=1$.

The only remaining possibility is $L\alpha$ should approach zero for some $\mathbf{x} = [x_{1} \quad x_{2}]$ as we iteratively update $x$. This gives us our termination/convergence criterion. 
% Now let us look at what does $L\alpha = 0$ imply to understand the role of $\alpha$.
\vspace{-0.5em}
\begin{align*}
    L\alpha &= [\nabla f_{1}(x) \quad \nabla f_{2}(x)] _{2}[\alpha_{1} \quad \alpha_{2}]^{T} \\
    &= \left[\begin{matrix} \frac{\partial f_{1}}{\partial x_{1}} & \frac{\partial f_{2}}{\partial x_{1}}\\ \frac{\partial f_{1}}{\partial x_{2}} & \frac{\partial f_{2}}{\partial x_{2}}\end{matrix}\right] [\alpha_{1} \quad \alpha_{2}]^{T} = [0 \quad 0]^{T}
\end{align*}
\vspace{-0.5em}

Let us assume any point $(x_{1},x_{2})$ in the feasible domain. What $\alpha$ values can achieve the above termination criterion? We now have two equations in two unknowns ($\alpha_{i}s$):
\vspace{-0.5em}
\begin{align*}
    \alpha_{1}\frac{\partial f_{1}}{\partial x_{1}} + \alpha_{2} \frac{\partial f_{2}}{\partial x_{1}} = 0 \\
    \alpha_{1}\frac{\partial f_{1}}{\partial x_{2}} + \alpha_{2} \frac{\partial f_{2}}{\partial x_{2}} = 0
\end{align*}
\vspace{-0.5em}
    
Eliminating $\alpha_{1}$ using the first equation and substituting in the second equation:
\begin{align*}
    \left[-\left(\frac{\partial f_{1}}{\partial x_{2}} \frac{\partial f_{2}}{\partial x_{1}}\right)/\left(\frac{\partial f_{1}}{\partial x_{1}} \right) + \frac{\partial f_{2}}{\partial x_{2}}\right] \alpha_{2} = 0
\end{align*}

For any $\alpha_{2} >0$, this implies: 
\begin{align*}
    &\left[-\left(\frac{\partial f_{1}}{\partial x_{2}} \frac{\partial f_{2}}{\partial x_{1}}\right)/\left(\frac{\partial f_{1}}{\partial x_{1}} \right) + \frac{\partial f_{2}}{\partial x_{2}}\right] = 0\\
    \Rightarrow &\left[ \frac{\partial f_{1}}{\partial x_{1}} \frac{\partial f_{2}}{\partial x_{2}} - \frac{\partial f_{1}}{\partial x_{2}} \frac{\partial f_{2}}{\partial x_{1}} \right] = 0
\end{align*}

Alternatively,
\begin{align*}
    det \left(\left[\begin{matrix} \frac{\partial f_{1}}{\partial x_{1}} & \frac{\partial f_{2}}{\partial x_{1}}\\ \frac{\partial f_{1}}{\partial x_{2}} & \frac{\partial f_{2}}{\partial x_{2}}\end{matrix}\right] \right) = det(L)= 0
\end{align*}

Note that for any matrix $A \neq \mathbf{0}$, $Ax=0$ can be solved for a non-trivial $x \neq 0$ if and only if A has a null-space; or A is low rank; or if A is square then it's determinant is zero. $\qedsymbol$



The $det(L)$ matrix defined in \textbf{Eq. 6 (main)} is given by:
\begin{align*}
    L = \begin{bmatrix}
    \nabla F & \nabla G \\
    \mathbf{0} & G
    \end{bmatrix}
\end{align*}
To achieve $det(L)=0$ requires that either:

\begin{enumerate}[leftmargin=*]
	\item $\nabla F(x)=0$: atleast one objective function has reached its optimum (local/global minima/maxima under a min/max setting); {\em and / or} 
	\item  $G(x)=0$: at least one constraint is satisfied.
\end{enumerate}
This criteria is only applicable for square systems. However, for practical problems, the system might become non-square, hence we need to satisfy $det(L^TL)=0$ following \textbf{Eq. 7 (main)}. One might think that it's a different optimization problem. However satisfying $det(L^TL)=0$ mathematically provides the same justification and we provide the derivation of it.
\begin{align*}
    det(L^TL) &= \begin{bmatrix}
    \nabla F^T & \mathbf{0} \\
    \nabla G^T & G^T
    \end{bmatrix} \begin{bmatrix}
    \nabla F & \nabla G \\
    \mathbf{0} & G
    \end{bmatrix}\\
    &= \begin{bmatrix}
    \nabla F^T \nabla F & \nabla F^T \nabla G \\
    \nabla G^T \nabla F & \nabla G^T \nabla G + G^TG
    \end{bmatrix} \numberthis \label{eq:efjc}
\end{align*}
We now observe Eq. \ref{eq:efjc} for the two cases prescribed above and see if $det(L^TL)$ evaluates to zero or not. For Case 1, where $\nabla F=0$, Eq. \ref{eq:efjc} reduces to:
\begin{align*}
    det(L^TL) &= \begin{bmatrix}
    \mathbf{0} & \mathbf{0} \nabla G \\
    \nabla G^T \mathbf{0} & \nabla G^T \nabla G + G^TG
    \end{bmatrix}
\end{align*}
which is low-rank since row 1 equates to 0.
For Case 2, where $G=0$, Eq. \ref{eq:efjc} reduces to:
\begin{align*}
    det(L^TL) &= \begin{bmatrix}
    \nabla F^T \nabla F & \nabla F^T \nabla G \\
    \nabla G^T \nabla F & \nabla G^T \nabla G + 0
    \end{bmatrix}\\
    &= \nabla F^T \nabla G^T \begin{bmatrix}
    \nabla F & \nabla G \\
    \nabla F & \nabla G
    \end{bmatrix}
\end{align*}
which is low-rank again because row 1 and row 2 are equal. Hence it is easy to observe that satisfying $det(L)=0$ is equivalent to satisfying $det(L^TL)=0$. $\qedsymbol$

% \vspace{-1em}



\vspace{-1em}
\section{Experimental Setup Details}
\label{app:setup}
\vspace{-1em}

{\bf Experimental Setup}. We use an Nvidia 2060 RTX Super 8GB GPU, Intel Core i7-9700F 3.0GHz 8-core CPU and 16GB DDR4 memory for all experiments. Keras \citep{chollet2015} is used on a Tensorflow 2.0 backend with Python 3.7 to train the SUHNPF networks and evaluate the MTL solvers. For optimization, we use AdaMax \citep{kingma2014adam} with parameters (\textit{lr}=$0.001$).

{\bf SUHNPF Setup}. Each training step runs for $2$ epochs, with $50$ steps per epoch. Thus, if the network takes $I$ iterations to converge, then the effective epochs taken by the network is $2I$. For computing the gradient of the Fritz-John matrix \wrt the input variables $x$, we use Tensorflow's {\tt GradientTape}\footnote{\scriptsize\url{https://www.tensorflow.org/api_docs/python/tf/GradientTape}}, which implicitly allows us to scale the computation of the gradient matrix $\nabla det$ to arbitrarily large dimensions of variable $x$. To compute the gradient update on $\mathcal{P}1$, we use a learning rate of $\eta=0.01$.

{\bf MTL Setup}. Sourcecode for LS, MOOMTL, PMTL and EPO solvers use EPO's repository\footnote{\scriptsize\url{https://github.com/dbmptr/EPOSearch}}, while EPSE\footnote{\scriptsize\url{https://github.com/mit-gfx/ContinuousParetoMTL}} and PHN\footnote{\scriptsize\url{https://github.com/AvivNavon/pareto-hypernetworks}} codes are taken from their individual repositories. %We do not alter any solver code for MTL methods, only the functions definitions are updated for each benchmark case.



%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\vspace{-1em}
\section{General Discussion} \label{app:discussion}
\vspace{-1em}


%\ml{show results before going negative}
\textbf{Handling Non-Convex forms}: Pareto optimal solution set is a collection of saddle points \citep{van1994saddle,ehrgott2005saddle} of an MOO problem, wherein no objective can be further improved without penalizing at least one of the other objectives. This entails min-max optimization to minimize  objectives (such as loss functions) while simultaneously maximizing trade-offs between them. Although prior works \citep{sener2018multi,lin2019pareto,mahapatra2020multi} have asserted that Karush-Kuhn-Tucker (KKT) conditions \citep{boyd2004convex} in this min-max setting ensure that MTL methods find (correct) Pareto optimal solutions, it is known that KKT conditions hold true only for convex cases. \citet{gobbi2015analytical} further show that KKT-based criteria can give Pareto solutions only under fully convex setting of objectives and constraints. 


\textbf{Evaluation on Benchmarks}. Because the Pareto solution is often unknown on real MOO problems, OR works have advocated that any proposed Pareto solver should first be tested on synthetic MOO with known analytic solutions. This permits controlled experimentation that vary MOO problem difficulty (\eg non-convexity in variable and function domains, presence of constraints, \etc) in order to assess the capabilities and measure the true accuracy against a known front. Ideally studies should evaluate against synthetic benchmark problems that vary in difficulty, and there is sometimes ambiguity and confusion in referring to an MOO problem as non-convex without clarifying the specific non-convex aspects. Difficulty can also vary greatly depending on whether non-convexity occurs in the objectives, constraints, or the front itself. 

{\bf Termination of Solvers}. An iterative solver should define termination criteria based on an error tolerance being satisfied and/or inability to further improve. It is also important that a solver reports inability to converge (achieve the termination criteria/error tolerance) within the specified maximum iterations. While both HNPF (used by SUHNPF) and EPO (used by PHN) define such error tolerance criteria for termination, inspection of source code for MOOMTL \citep{sener2018multi}, PMTL \citep{lin2019pareto}, and EPSE \citep{ma2020efficient} iterative solvers (at the time of our submission) shows support only for running a fixed number of iterations, without other termination criteria.
See the following sourcecode links to solvers for MOOMTL\footnote{\scriptsize\url{ https://github.com/dbmptr/EPOSearch/blob/master/toy\_experiments/solvers/moo\_mtl.py}}, PMTL\footnote{\scriptsize\url{ https://github.com/dbmptr/EPOSearch/blob/master/toy\_experiments/solvers/pmtl.py}}, and EPSE\footnote{\scriptsize\url{ https://github.com/mit-gfx/ContinuousParetoMTL/blob/master/pareto/optim/hvp\_solver.py}}.


\vspace{-1em}
\section{Convergence Plots of SUHNPF} \label{app:conv}
\vspace{-1em}

\begin{figure}[ht]
    \centering
    %\vspace{-1.0em}
     \begin{subfigure}{0.49\linewidth}
      \centering
      \includegraphics[width=\linewidth]{figs/illus-form2-iter0.png}
      \vspace{-2em}
      \caption{Iteration 0 (Start)}
    \end{subfigure}
    \begin{subfigure}{0.49\linewidth}
      \centering
      \includegraphics[width=\linewidth]{figs/illus-form2-iter1.png}
      \vspace{-2em}
      \caption{Iteration 1}
    \end{subfigure}
    \begin{subfigure}{0.49\linewidth}
      \centering
      \includegraphics[width=\linewidth]{figs/illus-form2-iter2.png}
      \vspace{-2em}
      \caption{Iteration 1}
    \end{subfigure}
    \begin{subfigure}{0.49\linewidth}
      \centering
      \includegraphics[width=\linewidth]{figs/illus-form2-iter3.png}
      \vspace{-2em}
      \caption{Iteration 3}
    \end{subfigure}
    \begin{subfigure}{0.49\linewidth}
      \centering
      \includegraphics[width=\linewidth]{figs/illus-form2-iter4.png}
      \vspace{-2em}
      \caption{Iteration 4}
    \end{subfigure}
    \begin{subfigure}{0.49\linewidth}
      \centering
      \includegraphics[width=\linewidth]{figs/illus-form2-iter5.png}
      \vspace{-2em}
      \caption{Iteration 5 (Converged)}
    \end{subfigure}
    \vspace{-1.0em}
    \caption{\small Case I: Variable domain. The gray line show the true analytic solution ($0 \leq x_1 \leq 1, x_2=0$). SUHNPF Pareto candidates $\mathcal{P}1$ (red dots) converge in 5 iterations.}
    \label{fig:varconv}
    %\vspace{-1em}
\end{figure}

Convergence of Pareto candidates towards the weak Pareto front over iterations for the variable (\textbf{Fig. \ref{fig:varconv}}) and functional (\textbf{Fig. \ref{fig:funcconv}}) domains are shown for benchmark Case I considered in our work.

\begin{figure}[ht]
    \centering
    %\vspace{-0.5em}
    \begin{subfigure}{0.49\linewidth}
      \centering
      \includegraphics[width=\linewidth]{figs/iter0-func-form2.png}
      \vspace{-2em}
      \caption{Iteration 0 (Start)}
    \end{subfigure}
    \begin{subfigure}{0.49\linewidth}
      \centering
      \includegraphics[width=\linewidth]{figs/iter1-func-form2.png}
      \vspace{-2em}
      \caption{Iteration 1}
    \end{subfigure}
    \begin{subfigure}{0.45\linewidth}
      \centering
      \includegraphics[width=\linewidth]{figs/iter2-func-form2.png}
      \vspace{-2em}
      \caption{Iteration 2}
    \end{subfigure}
    \begin{subfigure}{0.49\linewidth}
      \centering
      \includegraphics[width=\linewidth]{figs/iter3-func-form2.png}
      \vspace{-2em}
      \caption{Iteration 3}
    \end{subfigure}
    \begin{subfigure}{0.45\linewidth}
      \centering
      \includegraphics[width=\linewidth]{figs/iter4-func-form2.png}
      \vspace{-2em}
      \caption{Iteration 4}
    \end{subfigure}
    \begin{subfigure}{0.49\linewidth}
      \centering
      \includegraphics[width=\linewidth]{figs/pareto-form2.png}
      \vspace{-2em}
      \caption{Iteration 5 (Converged)}
    \end{subfigure}
    \vspace{-1em}
    \caption{\small Case I: Functional domain mapping to \textbf{Fig. 1 (main)}. SUHNPF Pareto candidates $\mathcal{P}1$ (red dots) converge in 5 iterations.}
    \label{fig:funcconv}
    \vspace{-1.0em}
\end{figure}

\begin{figure*}[t]
    \centering
    \vspace{-1em}
     \begin{subfigure}{0.3\linewidth}
      \centering
      \includegraphics[width=\linewidth]{figs/form1loss.png}
      \vspace{-2em}
      \caption{Case I}
    \end{subfigure}
    \begin{subfigure}{0.3\linewidth}
      \centering
      \includegraphics[width=\linewidth]{figs/form2loss.png}
      \vspace{-2em}
      \caption{Case II}
    \end{subfigure}
    \begin{subfigure}{0.3\linewidth}
      \centering
      \includegraphics[width=\linewidth]{figs/form3loss.png}
      \vspace{-2em}
      \caption{Case III}
    \end{subfigure}
    \vspace{-1em}
    \caption{\small Loss profile for cases. Note that since the error threshold $\epsilon$ was set to $10^{-4}$ for the benchmark cases, the algorithm terminates once the Binary Cross Entropy (blue) loss falls below the threshold (value at epoch 10 $\leq 10^{-4}$). We also show the Mean Squared Error (dashed red) between the Pareto candidate set $\mathcal{P}1$ and the analytical solution for each iteration. Because each iteration takes two epochs, this leads to the ``staircase'' MSE shown.}
    \label{fig:loss}
    \vspace{-1em}
\end{figure*}

\section{Additional Benchmarks} \label{app:benchmarks}
\vspace{-1em}

We consider two additional synthetic benchmark cases considered by \citet{navon2021learning}. We demonstrate that SUHNPF works well in these cases since the considered functions are either convex or monotone within the feasible domain for both cases.

\textbf{Case A:}
\vspace{-0.5em}
\begin{align*}
    &f_1(x_1,x_2) = ((x_1-1)x_2^2 + 1)/3, \, f_2(x_1,x_2) = x_2\\
    &\text{s.t.} \,\, g_1,g_2: 0 \leq x_1, x_2 \leq 1 \numberthis \label{eq:phn1}
\end{align*}
\vspace{-2.0em}

\textbf{Case B:}
\vspace{-0.5em}
\begin{align*}
    &f_1(x_1,x_2) = x_1, \, f_2(x_1,x_2) = 1 - (x_1/(1+9x_2))^2 \\
    &\text{s.t.} \,\, g_1,g_2: 0 \leq x_1, x_2 \leq 1 \numberthis \label{eq:phn2}
\end{align*}
\vspace{-2.0em}

Please note that although in PHN \citep{navon2021learning}, the form of $f_2=x_1$ for Eq. \ref{eq:phn1}, we believe it is a typo \wrt the original work by \cite{evtushenko2013nonuniform}, where this case was proposed, as the reported Pareto front in their work is achieved only for $f_2=x_2$. We therefore proceed with this updated form.

\begin{figure}[ht]
    \centering
    \vspace{-0.5em}
    \begin{subfigure}{0.4\linewidth}
      \centering
      \includegraphics[width=\linewidth]{figs/illus-phncaseA.png}
      \vspace{-2.0em}
      \caption{Case A}
    \end{subfigure}
    \begin{subfigure}{0.4\linewidth}
      \centering
      \includegraphics[width=\linewidth]{figs/illus-phncaseB.png}
      \vspace{-2.0em}
      \caption{Case B}
    \end{subfigure}
    \vspace{-1.0em}
    \caption{\small Functional Domain of cases from PHN}
    \vspace{-1em}
    \label{fig:casesphn}
\end{figure}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Loss Profiles} \label{app:loss}
\vspace{-1em}

\textbf{Fig. \ref{fig:loss}} shows the loss profiles for the benchmark cases I - III. SUHNPF converged in $5$ iterations, with each iteration running for $2$ epochs, using error tolerance $10^{-4}$ for both the outer gradient descent loop $\epsilon_{outer}$ and inner gradient descent loop $\epsilon_{inner}$. Since the last layer of the SUHNPF network classifies points as being {\em weak} Pareto or not, the loss enforced is Binary Cross Entropy (blue line). We also report the Mean Squared Error (MSE, dashed red line) between the current iterate of point set $\mathcal{P}1$ and the true analytical solution manifold. \textbf{Alg. 1 (main)} updates the Pareto candidate set in the outer descent loop. Since the inner descent loop that measures the training loss itself $\mathcal{P}1$ has ran twice for $2$ epochs, MSE is be measured only once per iteration. This results in the staircase nature of the MSE loss.


\vspace{-1em}
\section{Runtime Complexity} \label{app:runtime}
\vspace{-1em}

Assume the following notation from main text for clarity i.e. $k$: num objectives; $m$: num constraints; $n$: num variables; $\mathcal{P}$: num Pareto candidates; $\mathcal{I}$: num iterations till convergence. In \textbf{Algorithm  1 (main)}, the outer while loop (lines 4-9) takes $\mathcal{I}$ iteration till convergence. Note that for any gradient descent based solver, the number of iterations $\mathcal{I}$ is not known \textit{a priori}. Per iteration, step 5 involves training the neural network with $\mathcal{P}$ points using the FJC and error computation in step 6 jointly costing $\mathcal{O}(\mathcal{P}(k+m)^2n)$. Step 7 can be effectively broken down into two serial sub-computations: a) Computing the gradient of the determinant takes cost $\mathcal{O}((k+m)^2n)$ and b) calculating the gradient of the determinant as $\mathcal{O}(\mathcal{P}k^3)$. Step 8 takes $\mathcal{O}(\mathcal{P}(k+m))$ due to the update of $\mathcal{P}$ points using a $(k+m)$-dimensional vector. Hence the total compute cost per iteration is $\mathcal{O}(\mathcal{P}(k+m)^2n + (k+m)^2n + \mathcal{P}(k+m)^3 + \mathcal{P}(k+m))$, which can be simplified to $\mathcal{O}(\mathcal{P}(k+m)^2n + \mathcal{P}(k+m)^3)$. Under a practical deep MTL, $n \gg k,m$ (\ie variable dimension is strictly greater than the number of functions and constraints in any neural setting), the complexity is dominated by the term $\mathcal{O}(\mathcal{P}(k+m)^2n)$, where the scaling is linear in terms of the variable dimension $n$, and quadratic in the number of functions and constraints $k,m$. Thus, the overall compute cost for $\mathcal{I}$ iterations is $\mathcal{I} \times \mathcal{O}(\mathcal{P}(k+m)^2n)$. Note that this is similar to the runtime complexity reported in EPO \citep{mahapatra2020multi}, where their \textbf{point based solver} takes $\mathcal{O}(k^2n)$ for each point for each iteration, leading to a total compute cost of $\mathcal{I} \times \mathcal{O}(\mathcal{P}k^2n)$ for $\mathcal{P}$ points and $\mathcal{I}$ iterations, under an unconstrained optimization settings. In SUHNPF, since we also support constrained optimization, we have replaced ‘$k$’ with ‘$k+m$’.


We report both the asymptotic and actual runtimes for EPO and SUHNPF in \textbf{Table \ref{tab:runtime}}. Note that for Case I, although the complexity of SUHNPF is twice that of EPO ($\mathcal{O}(400)$ \vs $\mathcal{O}(800)$) \ie both are asymptotically similar, in actual runtime SUHNPF is much faster than EPO (10s \vs 752s). SUHNPF moves candidates towards being Pareto optimal through the use of the FJC guided discriminator in \textbf{Algorithm 1 (main)}, while EPO has to solve two separate primal and dual problems to find Pareto optimal points. This imparts SUHNPF lower runtime per iteration than EPO, hence it converges faster too. Furthermore, SUHNPF constructs the approximate Pareto manifold in addition to finding $50$ candidates on it. To achieve similar functionality based on EPO would require first computing $50$ points via $50$ runs of EPO, then running a neural network to regress over those $50$ points. This is what PHN \citep{navon2021learning} in its  PHN-EPO configuration. Hence in \textbf{Table~4 (main)}, PHN has higher runtime than EPO.

\begin{table}[bht]
	\centering
	\caption{\small Per-iteration time complexity and full runtime of Cases I-III for EPO \vs SUHNPF. We do not report complexity for EPO on Case II, because their solver reported NaNs, or on Case III, because EPO does not support constrained optimization. As shown above, the asymptotic complexity of EPO is given by $\mathcal{O}(\mathcal{P}k^2n$), whereas that of
	SUHNPF is given by  $\mathcal{O}(\mathcal{P}(k+m)^2n + \mathcal{P}(k+m)^3)$.}
	\vspace{-1em}
 	\resizebox{\columnwidth}{!}{%
		\begin{tabular}{l|cccc|cc|cc}
			\toprule
			& \multicolumn{4}{c|}{Variables} & \multicolumn{2}{c|}{EPO} & \multicolumn{2}{c}{SUHNPF} \\
 			\bf Cases & $k$ & $m$ & $n$ & $\mathcal{P}$ & Complexity & Runtime (sec) & Complexity & Runtime (sec) \\
			\midrule
			I & 2 & 0 & 2 & 50 & $\mathcal{O}(400)$ & 752 & $\mathcal{O}(800)$ & 10 \\ 
			II & 2 & 0 & 30 & 50 & $\mathcal{O}(6000)$ & - & $\mathcal{O}(6400)$ & 20 \\ 
			III & 2 & 2 & 2 & 50 & - & - & $\mathcal{O}(4800)$ & 10 \\ \bottomrule      
	\end{tabular}}
	\label{tab:runtime}
\end{table}

While correctness and point density in finding the true Pareto optimal solution should be our top priority in comparing methods, we also report run-time of SUHNPF \vs other MTL approaches on the studied cases. We explicitly request that each method generate $50$ Pareto candidates, within the feasible functional domain. \textbf{Table~4 (main)} reports the percentage of such candidates obtained, and the overall execution time, averaged over $10$ runs each, given our experimental setup in  Appendix \ref{app:setup}.


PHN uses either EPO or LS as their base solver, hence we report the total time that includes the (a) run-time of the base solver; and (b) the neural network run-time to learn the regression manifold. Cases I and III have a 2D variable domain, where SUHNPF takes $1$s  per epoch, with $2$ epochs for training in Step 7 of \textbf{Alg. 1 (main)}. Both the cases took $5$ epochs to converge, resulting in a total run-time of $10$s. Case II has a 30D variable domain where SUHNPF takes $2$s per epoch resulting in a total run-time of $20$s. While LS and MOOMTL are at similar run-time scale with SUHNPF, they fail to generate an even spread of points (\textbf{Fig. 3 (main) (a,b)}).

\vspace{-1em}
\section{Space Complexity Analysis} \label{app:space}
\vspace{-1em}

MTL methods solve problems in both primal and dual space \ie gradient of the objectives in the primal and the trade-off $\alpha$'s in the dual. SUHNPF however works only in the primal space \wrt the gradient of the functions necessary in the construction of the Fritz-John matrix, since the FJC ensure $\alpha$ free stationary point identification. Thus, the additional dual optimization space is not required. To fairly compare \wrt MTL methods, we consider the general cost of both such systems under a unconstrained setting \ie only objectives and no additional constraints. Thus $k,n$ indicate the number of objectives and the dimension of the variable space.

\textbf{SUNHPF}. To find $P$ Pareto candidates, SUHNPF updates $P$ points of size $Pn$. The $\nabla F^T \nabla F$ and $\nabla det$ matrices are of size $k^2$ and $nk$ respectively. The total memory cost is thus of order $O(n(P+k)+k^2))$.

\textbf{MTL}. To find $P$ candidates, MTL methods uses $P$ cones or rays requiring size $Pn$. The gradient matrix of the objective function $\nabla F$ takes $nk$, constructing the simplex takes $k^2$, solving for trade-off $\alpha$ takes $k^2$ and the iterative update requires additional $nk$ memory. The total memory cost is of order $O(n(P+2k)+2k^2))$.

\vspace{-1em}
\section{Uniformity and Coverage} \label{app:coverage}
\vspace{-1em}

SUHNPF starts with a random set of $2\mathcal{P}$ candidates.  $\mathcal{P}$ of these are then altered through our FJC guided \textbf{Algorithm 1 (main)} towards being Pareto candidates within the tolerance bound. Since all the starting points are generated uniformly at random within the feasible set, they tend to have even initial spread. When the FJC guided descent is run, each of these points are then directed towards their nearest Pareto front via gradient descent. Empirically, we observe this suffices to achieve good spread of the $\mathcal{P}$ training points on the Pareto front, without any explicit optimization target to promote or enforce even spread. 


MTL methods partition space into equal sized cones or preference rays, which leads to semi-uniformity \textit{iff} the obtained candidates lie within the specified cone or in the vicinity of the preference ray. EPO \citep{mahapatra2020multi} is the only method which has this explicitly loss criteria to ensure even spread / uniformity of points in their algorithm. However, note that these cone and ray approaches implicitly assume a symmetric and uniform nature of the front, which is not known \textit{a priori} and rarely seen in practice.

To measure spread, we distinguish two concepts:
\begin{description}
    \item[uniformity] evenness of spread of candidates along the front
    \item[coverage] the extent of the front spanned by extreme Pareto points
\end{description}

For {\em uniformity}, we report the average and maximum euclidean distance between two neighboring Pareto candidates. Smaller values indicate greater density (average and worst case, respectively). 

For \textit{coverage} we report the $l2$ distance between the two farthest Pareto candidates on the front. Larger values are better, indicating a wider range of the front is spanned by Pareto points found.

Table \ref{tab:spread} reports results on  benchmark Case I. Note that for fair evaluation, we only consider candidates that are produced within the feasible functional bounds for the problem. We also observe that LS performs best in terms of coverage ($l2$) for our run, while SUHNPF performs better across both uniformity measures.

\begin{table}[bht]
	\centering
	\caption{\small Evenness of spread of Pareto points found across methods for Case I, as measured by \textit{uniformity} and \textit{coverage}. }
	\vspace{-1em}
 	\resizebox{\columnwidth}{!}{%
		\begin{tabular}{c|cccccc|c}
			\toprule
			\bf Method & LS & \!\!\!\!\!{\footnotesize MOOMTL}\!\!\!\!\! & PMTL\!\! & \!\!EPO\!\! & \!\!EPSE\!\! & \!\!PHN\!\! & \bf \!\!\!{\footnotesize SUHNPF}\!\!\! \\ \midrule 
			Avg. Dist. & 0.087 & 0.089 & 0.030 & 0.035 & 0.059 & 0.031 & \textbf{0.029} \\      
			Max. Dist. & 0.261 & 0.235 & 0.117 & 0.122 & 0.231 & 0.085 & \textbf{0.078} \\   
			Coverage & \textbf{1.256} & 0.843 & 1.110 & 1.201 & 1.252 & 1.214 & 1.254 \\ \bottomrule
	\end{tabular}}
	\label{tab:spread}
\end{table}

\vspace{-1em}
\section{Why Pareto Front Learning?} \label{app:motivation}
\vspace{-1em}

The goal of PFL \citep{navon2021learning} (or any Pareto HyperNetwork) is to induce the full Pareto manifold from training in order to be able to show users the entire space of optimal trade-offs that are feasible. This empowers users to then choose any solution point they prefer on the manifold, \textit{a posteriori}. In contrast, prior point-based methods from operations research (OR) and multi-task learning (MTL) find individual Pareto points only. Lacking prior knowledge of the manifold, a user would have to formulate an abstract preference trade-off over objectives (\eg 25\% $f_1$, 75\% $f_2$), input that to a point solver, and then see what they get. If they don’t like the result, they would then have to iterate, running the point-solver repeatedly with different preferences until satisfied.


\textit{Motivation for building on HNPF.} We build on HNPF \citep{singh2021hybrid} for two key reasons: 1) its support for non-convexity in functional objectives, variable domain, and/or constraints, due to the usage of the Fritz-John Conditions (FJC) and 2) its guarantee of Pareto solution correctness within the $\epsilon$ error tolerance parameter (1e-4 in our experiments). In addition to this, point-based solvers from OR are accurate and support non-convexity but are inefficient (wrt the scaling of variable dimension), while MTL solvers are efficient but have limited support and accuracy with non-convexity. SUHNPF strives to deliver both accuracy and scalability. 

\textit{Motivation for SUHNPF over prior work.} Prior MTL approaches to learning the Pareto front have sought to find an even spread of Pareto points across the front by using ray or cone-based methods to partition the space uniformly. However, these works assume that the spread is uniform and symmetrical in the functional space. However, the nature of the Pareto front is not known \textit{a priori} and so making such symmetry assumptions can lead to misleading results and expectations. For example, \textbf{Fig. 2 (main)} and \textbf{Fig. 5 (main)} show that the Pareto front in the functional domain is not symmetric around the 45-degree line (or implicitly assumed as $\alpha=0.5$). In contrast, adopting a hypernetwork approach allows learning the full Pareto front without any such assumptions regarding its shape. In comparison, the prior PHN \citep{navon2021learning} hypernetwork fits a posthoc regression surface over a set of Pareto points found by point-based solvers. In addition to being posthoc, it also inherits all the above limitations of point-based solvers (EPO or LS). In contrast, SUHNPF explicitly learns a classification boundary between Pareto vs. non-Pareto points via FJC.

\begin{figure*}[th]
    \centering
    \vspace{-1em}
     \begin{subfigure}{0.3\linewidth}
      \centering
      \includegraphics[width=\linewidth]{figs/ls01.pdf}
      \vspace{-2em}
      \caption{$\alpha=0.1$}
    \end{subfigure}
    \begin{subfigure}{0.3\linewidth}
      \centering
      \includegraphics[width=\linewidth]{figs/ls05.pdf}
      \vspace{-2em}
      \caption{$\alpha=0.5$}
    \end{subfigure}
    \begin{subfigure}{0.3\linewidth}
      \centering
      \includegraphics[width=\linewidth]{figs/ls09.pdf}
      \vspace{-2em}
      \caption{$\alpha=0.9$}
    \end{subfigure}
    \vspace{-1em}
    \caption{\small Contour Plot of the scalarized objective $S(x_1,x_2)=\alpha f_1(x_1,x_2) + (1-\alpha)f_2$ as a function of variables $x_1,x_2$ for different values of $\alpha$. The heatmap shows the discretized contour levels. Red 'x' denotes the position of the minima on the surface.}
    \label{fig:lscase1}
    \vspace{-1em}
\end{figure*}

\textit{Extraction of Trade-off value.} For any general Pareto HyperNetwork, once the entire Pareto front is approximated, one can simply select a point on the front and compute the value of trade-off $\alpha$ \textit{a posteriori}. The cost for post-computing the value would vary from one method to another. Similar to HNPF, we also take $\mathcal{O}(k)$ \ie linear runtime in objectives for $\alpha$ extraction.

\textit{Need for differentiable functions.} We rely on objectives/constraints to be at least once differentiable, in accordance with the Fritz-John Conditions in \textbf{Section 4 (main)}. Furthermore, since our framework relies on gradients of the objectives to check for optimality, as a consequence, we need the objectives differentiable at least once for their gradient to exist and be computable. Indeed, this is true for all MTL methods as well, and in general any method that relies on gradient descent. As noted earlier, this further motivates continuing development of differentiable measures \citep{swezey2021pirank}.

\textit{Convex Utility.}  Although the final objective is a weighted linear combination of the two objectives: $\alpha f_1 + (1-\alpha) f_2$, note that $f_1$ and $f_2$ are losses operating on a neural network, hence the loss surface for both $f_1$ and $f_2$ is non-convex in nature. Convex combination of two convex functions is guaranteed to be convex, but not for the convex combination of two non-convex functions \citep{boyd2004convex}. 
%As such, we are unable to verify if methods like optimistic linear support, or recursive simplex subdivision, would be able to converge to good local optima under a neural setting. 
Many practical classification and recommendation problems have been shown to be non-convex in nature \citep{hsieh2015pu}, hence their linear (convex) combination can still lead to non-convex fronts. Practical examples include Low Rank Matrix Recovery and Robust Linear Regression as in \citep{jain2017non}.


\textit{Need for benchmarking on non-convex setting.} %Although the method proposed in [Vamplew et al., 2009] constructs the convex hull of points in order to aim for a better solution, under practical situations such convex hull strategies might yield solutions that are not actually feasible. 
Refer to \textbf{Fig 2 (main)} for Case I, where the non-convex region of the front is the boundary of the feasible set of solutions. If one takes convex combinations of the endpoints (1,0) and (0,1), constructing a $135^{\circ}$ line, points can always be obtained on that line, that have lower values on both functions $f_1$ and $f_2$ which are strictly lying below the non-convex region of the front. However, given the feasible domain for the problem, one cannot go lower than the non-convex portion of the front around that region. 


Although various Pareto solvers exist in different research communities,
they do not scale well to practical problems at hand, especially to
optimize weights of large neural networks. MTL approaches were motivated
as developing scalable Pareto solvers to tackle such problems.


\vspace{-1em}
\section{Analysis of Linear Scalarization} \label{app:ls}
\vspace{-1em}

We refer to Case I here for analysis on Linear Scalarization (LS).  
\resizebox{\linewidth}{!}{
\begin{minipage}{\linewidth}
\begin{align*}
    &f_1(x_1,x_2) = x_1 , \, f_2(x_1,x_2) = 1 + x_2^2 - x_1 - 0.1sin 3 \pi x_1\\
    &\text{s.t.} \quad g_1: 0 \leq x_1 \leq 1, g_2: -2 \leq x_2 \leq 2
\end{align*}
\end{minipage}
}

Pareto optimal points here correspond to stationary points for the scalarized objective $S(x_1,x_2)=\alpha f_1 + (1-\alpha) f_2$ for different trade-offs  of $\alpha \in [0,1]$.
Although some prior studies have asserted that LS cannot handle any non-convexity, \textbf{Fig. 3 (main) (a)} shows that LS finds Pareto points in the non-convex portions of the front.

To explore this case further, Fig. \ref{fig:lscase1} plots the contour  surface of the objective $S(x_1,x_2)$ as a function of its variables $x_1,x_2$, for three different values of $\alpha \in [0.1, 0.5, 0.9]$ (similar functional plots can be shown for any values of $\alpha \in [0,1]$). Across plots, we observe that the functional landscape contains one or more minima (marked by a red cross {\color{red}$\times$}) for $\alpha \in (0,0.5]$. Any gradient descent algorithm would settle on one of these minima, depending on the choice of initialization and step size. On the other hand, for $\alpha=0$ or $\alpha =1$ (optimizing only one objective, not shown), and for $\alpha \in (0.5,1)$ (\ie linear function $f_1=x$ dominating), there are no optima at all. In these cases, any gradient descent algorithm would settle on the boundary of the feasible set. We thus observe for Case I that $\forall \alpha \in [0,1]$, LS will always find a feasible a Pareto candidate.
% \ml{need to mention case III if you will plot it too.}\sg{just case 1, case 2,3 no1 complained about LS}

Note that we adopt the LS implementation from PMTL's public sourcecode only\footnote{\scriptsize\url{https://github.com/Xi-L/ParetoMTL}}. It is fairly straightforward to plug the Case I functions into their code and observe that LS is indeed producing Pareto candidates in the non-convex region of the front.

Note that if Case I had contained local or global maxima, LS's solving a minimization problem, $\textrm{min}\, S(x_1,x_2)$, would naturally not find these maxima.


\balance
\bibliography{References}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\end{document}
