% \documentclass{uai2022} % for initial submission
\documentclass[accepted]{uai2022} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2022} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2022} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

%%%%%%%%%%

\usepackage{balance}
\usepackage{wrapfig}
\usepackage{soul}
\usepackage{pifont}
\usepackage{amsthm}
\usepackage{amsmath,amsfonts,amssymb}
\usepackage{mathtools}
\usepackage{multirow}
\usepackage{subcaption}
\usepackage{algorithm}
\usepackage{algorithmicx}
\usepackage[noend]{algpseudocode}
\usepackage{comment}
\usepackage{enumitem}
\usepackage{xspace}
\usepackage{epigraph}

\usepackage{xcolor}
\hypersetup{
    colorlinks=true,
    linkcolor=teal,
    citecolor=teal,
    urlcolor=teal
}

\newcommand*{\eg}{\textit{e.g.,}\@\xspace}
\newcommand*{\ie}{\textit{i.e.,}\@\xspace}
% \newcommand*{\st}{\textit{s.t.}\@\xspace}
\newcommand*{\vs}{\textit{vs.}\@\xspace}
\newcommand*{\wrt}{\textit{w.r.t.}\@\xspace}
\makeatletter
\newcommand*{\etc}{%
	\@ifnextchar{.}%
	{\textit{etc}}%
	{\textit{etc.}\@\xspace}%
}
% \renewcommand\@makefntext[1]{%
% \setlength\parindent{1em}%
% \noindent
% \mbox{\@thefnmark.~}{#1}}
\makeatother

\algtext*{EndWhile}
\algtext*{EndFor}
\algtext*{EndIf}
\algdef{SE}[DOWHILE]{Do}{doWhile}{\algorithmicdo}[1]{\algorithmicwhile\ #1}%

%%%%%%%%%%%%%% Math garbage
\newtheorem{theorem}{Theorem}
\newtheorem{definition}{Definition}
\newtheorem{corollary}{Corollary}[theorem]
\newtheorem{lemma}{Lemma}[theorem]
\newtheorem{proposition}{Proposition}
\newtheorem*{remark}{Remark}
\newtheorem{case}{Case}

%%%%%%%%%%%%%%%% Operator garbage
\DeclareMathOperator*{\argmax}{arg\,max}
\DeclareMathOperator*{\argmin}{arg\,min}
\DeclareMathOperator{\Tr}{Tr}

\makeatletter
\def\BState{\State\hskip-\ALG@thistlm}
\makeatother

\newcommand{\cmark}{{\color{blue}\ding{51}}}%
\newcommand{\xmark}{{\color{red}\ding{55}}}

\newcommand\numberthis{\addtocounter{equation}{1}\tag{\theequation}}

% ML 4/23/22
% http://mirrors.ibiblio.org/CTAN/macros/latex/contrib/changes/changes.english.pdf

%\usepackage[final]{changes}
% ---EXAMPLES---
% \added{text to add}
% \replaced{new text}{old text}
% \deleted{text to cut}.

% 2/19/22: todonotes don't play well with UAI package, overlap text, so switching
% to inline highlighlting for now
% \setlength{\marginparwidth}{0.5cm}
\usepackage[textsize=scriptsize,backgroundcolor=green,textwidth=15mm]{todonotes}
\newcommand\sg[1]{\todo{{\bf SG}: #1}} % Soumya
\newcommand\ml[1]{\todo{{\bf ML}: #1}} % Matt
% \newcommand\sg[1]{\hl{{\bf SG}: #1}} % Soumya
% \newcommand\ml[1]{\hl{{\bf ML}: #1}} % Matt

% \usepackage{changes}

\newcommand\D{\mathcal{D}}
\newcommand\s{\mathcal{S}}

\newcommand{\etal}{et al.~}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%%%%%%%%%%

\title{Learning a Neural Pareto Manifold Extractor with Constraints}

% The standard author block has changed for UAI 2022 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{Soumyajit Gupta}
\author[2]{Gurpreet Singh}
\author[3]{Raghu Bollapragada}
\author[4]{Matthew Lease}
% Add affiliations after the authors
\affil[1]{%
    %smjtgupta@utexas.edu, 
    Department of Computer Science, University of Texas at Austin, USA
}
\affil[2]{%
    %gurpreet@xtractorai.com, 
    XtractorAI
}
\affil[3]{%
    %raghu.bollapragada@utexas.edu, 
    Operations Research and Industrial Engineering, University of Texas at Austin, USA
}
\affil[4]{%
    %ml@utexas.edu, 
    School of Information, University of Texas at Austin, USA 
}
% \affil[5]{%
%     University of Texas at Austin, USA
% }
  
\begin{document}
\maketitle

\begin{abstract}
\vspace{-2em}
Multi-objective optimization (MOO) problems require balancing competing objectives, often under constraints. The {\em Pareto optimal} solution set defines all possible optimal trade-offs over such objectives. In this work, we present a novel method for  {\em Pareto-front learning}: inducing the full Pareto manifold at train-time so users can pick any desired optimal trade-off point at run-time. Our key insight is to exploit Fritz-John Conditions for a novel guided {\em double gradient descent} strategy. Evaluation on synthetic benchmark problems allows us to vary MOO problem difficulty in controlled fashion and measure accuracy \vs known analytic solutions. We further test scalability and generalization in learning optimal neural model parameterizations for Multi-Task Learning (MTL) on image classification. Results show consistent improvement in  accuracy and efficiency over prior MTL methods as well as techniques from operations research.
%We bridge previously disparate lines of work from operations research and more recent MTL work, showing . 

\end{abstract}

\vspace{-1em}
\section{Introduction}
\vspace{-1em}

Multi-Objective Optimization (MOO) problems require balancing multiple objectives, often competing with one another under further constraints \citep{van1994saddle,ehrgott2005saddle}. A {\em Pareto optimal} solution \citep{pareto1906manuale}  defines the set of all saddle points \citep{ehrgott2005saddle} such that no objective can be further improved without penalizing at least one other objective. 

As operational systems today increasingly seek to balance competing objectives, research on Pareto optimal learning has quickly grown across tasks such as fair classification \citep{balashankar2019fair,martinez2020minimax}, diversified ranking \citep{liu2019skyrec,sacharidis2019top}, and recommendation \citep{xiao2017fairness,azadjalal2017trust}. 
% See surveys by \citet{caton2020fairness,crawshaw2020multi,marler2004survey}.\sg{Last sentence needed here or in Lit Review?}
%Because practical MOO problems not only contain competing objectives, but often additional domain defined constraints, 
Many practical classification and recommendation problems have been shown to be non-convex \citep{hsieh2015pu}. A general Pareto solver should thus support optimization for both non-convex objectives and constraints. 

Because MOO problems typically lack a single global optimum, one must choose among optimal solutions by selecting a trade-off over competing objectives. Ideally this choice could be deferred to run-time, so that each user could choose whichever trade-off they prefer. Unfortunately, prior Pareto solvers have typically required training a separate model to find the Pareto solution point for each desired trade-off.

To address this, recent work has proposed {\em Pareto front learning} (PFL): inducing the full Pareto manifold in training so that users can quickly select any desired optimal trade-off point at run-time \citep{navon2021learning,lin2021controllable,singh2021hybrid}. These works learn a neural model manifold to map any desired trade-off over objectives to a corresponding Pareto point. See \textbf{Appendix L} for additional motivation for PFL. As with other supervised learning, inducing 
an accurate prediction model requires high quality training data, \ie Pareto points used for training should be accurate. 

In this work, we devise a efficient Pareto search procedure for \citet{singh2021hybrid}'s HNPF model, so that we may benefit from its correctness guarantees in identifying true Pareto points for PFL training. While HNPF supports non-convex MOO with constraints and bounded error, it suffers from a lack of scalability with increasing variable space. Our innovation is a novel,  % that exploits Fritz-John Conditions (FJC) \citep{levi2006application,gobbi2015analytical}. 
%Specifically, our {\em Scalable Unidirectional HNPF} (SUHNPF) model 
guided {\em double gradient descent} strategy, updating the candidate point set in the outer descent loop and the manifold estimators in the inner descent loop. 

Our evaluation spans both synthetic benchmarks and multi-task learning (MTL) problems. Benchmark problems allow us to conduct controlled experiments varying MOO problem complexity (\eg the presence of constraints and/or convexity in variable or function domains). Analytic solutions to benchmark problems enable us to measure the true accuracy of model predictions, something which is often difficult or impossible on real-world problems. Additional evaluation on a set of MTL problems in image classification enable us to further test scalability and generalization in learning Pareto optimal  weights for high dimensional neural models.

Results across synthetic benchmarks and MTL problems show clear, consistent advantages of SUHNPF in terms of capability (handling non-convexity and constraints), denser coverage and higher accuracy in recovering the true Pareto front, and greater efficiency (time and space). Beyond empirical findings, our conceptual framing and review of prior work also serves to further bridge complementary lines of prior work in MTL and operations research. For reproducibility, we share our sourcecode and data\footnote{\scriptsize\url{https://github.com/smjtgupta/SUHNPF}}.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\vspace{-1em}
\section{Definitions}
\vspace{-1em}

We adopt Pareto definitions from \cite{marler2004survey}. A general MOO problem can formulated as follows:
\vspace{-0.5em}
\begin{align*}
    &\underset{}{\textrm{optimize}} \quad F(x) = (f_1(x),\ldots,f_k(x)) \numberthis \label{eq:multi}\\
    &\text{s.t.} \quad x \in S = \{ x \in \mathbb{R}^n | G(x)=(g_1(x),\ldots,g_m(x)) \leq 0 \}
\end{align*}
\vspace{-2em}

with $n$ variables $(x_1,\ldots,x_n)$, $k$ objectives $(f_1,\ldots,f_k)$, and $m$ constraints $(g_1,\ldots,g_m)$. Here, $S$ is the feasible set, \ie the set of input values $x$ that satisfy the constraints $G(x)$. For a MOO problem optimizing $F(x)$ subject to $G(x)$, the solution is usually a manifold as opposed to a single global optimum, therefore one must find the set of all points that satisfy the chosen definition for an optimum. 
% Pareto solutions occur when the functions achieves optimality with any specified constraints.

\vspace{-0.5em}
\textbf{Strong Pareto Optimal:} A point $\tilde{x}^* \in S$ is {\em strong} Pareto optimal if no point in the feasible set exists that improves an objective without detriment to at least one other objective.
% A point $\tilde{x}^*$ is {\em strong} Pareto optimal if there does not exist another $x_j \in X$, \textit{s.t.} $f_p(x_j) \leq f_p(x^*)$ for all functions $f_p$ and $f_l(x_j) < f_l(x^*)$ for at least one function $f_l$:
\vspace{-0.5em}
\begin{align*}
    &\nexists x_j: f_p(x_j) \leq f_p(x^*), \quad \textrm{for} \quad p=1,2,\ldots,k\\
    &\exists l: f_l(x_j) < f_l(x^*) \numberthis \label{eq:pareto}
\end{align*}
\vspace{-2.0em}
% In other words, no point exists that improves an objective without detriment to at least one other objective.

\textbf{Weak Pareto Optimal:} A point $\tilde{x}^* \in S$ is {\em weak} Pareto optimal if no other point exists in the feasible set that improves all of the objectives simultaneously. This is different from strong Pareto, where points might exist that improve at least one objective without detriment to another.
\vspace{-0.5em}
\begin{align}
    \nexists x_j: f_p(x_j) < f_p(\tilde{x}^*), \quad \textrm{for} \quad p=1,2,\ldots,k
\end{align}
\vspace{-2.5em}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\vspace{-0.75em}
\section{Related Work} \label{sec:related}
\vspace{-1.0em}

% Pareto optimality has been a topic of interest in engineering, economics and recently gained significant attention in fair classification, diverse ranking and recommendation. Readers are referred to surveys by \citet{marler2004survey,caton2020fairness,crawshaw2020multi}. \ml{space cut?}

\textbf{Linear Scalarization (LS)}. A variety of work has adopted LS to find Pareto points \citep{xiao2017fairness,lin2019pareto,milojkovic2019multi}. 
% For example, the Weighted Sum Method (WSM) \cite{cohon2004multiobjective} 
LS converts an MOO into a SOO using a convex combination of objective functions and constraints. However, because Karush-Kuhn-Tucker (KKT) conditions are known to hold true only for convex cases \citep{boyd2004convex}, LS solutions are guaranteed to be Pareto optimal only under fully convex setting of objectives and constraints, as shown in \cite{gobbi2015analytical}.

\textbf{Operations Research (OR)}. A variety of OR methods support MOO problems with non-convex objectives and constraints, guaranteeing correctness within a user-specified error tolerance. Correctness has also been further verified by evaluation on synthetic MOO benchmark problems with known, analytic solutions. However, a key limitation of these methods is lack of scalability: they suffer from significant computational and run-time limitations as the variable dimension increases. Hence, they cannot be applied to optimizing neural model parameters for MOO problems.

\begin{table}[bht]
    \centering
    \vspace{-0.5em}
    \caption{\small SUHNPF \vs existing Operations Research (OR) and Multi-Task Learning (MTL) methods. OR methods account for both objectives and constraints, produce Pareto points only, and are known to find true Pareto points for non-convex MOO problems. % with known solutions. 
    However, these methods do not scale to high-dimensional neural MOO problems. In contrast, MTL methods scale well but typically do not support constraints and can struggle with non-convexity.} %, as benchmarking results show in Section \ref{sec:benchmark}.}
    \vspace{-1.0em}
    \resizebox{\columnwidth}{!}{%
    \begin{tabular}{ll|ccc}
    \toprule
        \bf Type & \bf Method & \bf Finds Only  & \bf Handles & \bf  Scalable \\ 
        & & \bf \!\!Pareto points\!\! & \bf  \!\!Constraints\!\! & \bf \!\!Neural MOO\!\!\!\! \\ \midrule
        %
        % \multicolumn{2}{c|}{Linear Scalarization: WSM\! [\citeyear{cohon2004multiobjective}]} & \cmark & \xmark & \\ \midrule
        %
    \multirow{4}{40pt}{Operations Research (OR)}        
        & NBI [\citeyear{das1998normal}] & \cmark & \cmark & \xmark \\
        & mCHIM [\citeyear{ghane2015new}] & \cmark & \cmark & \xmark \\
        & PK [\citeyear{pirouz2016computational}] & \cmark & \cmark & \xmark \\
        & HNPF [\citeyear{singh2021hybrid}] & \cmark & \cmark & \xmark \\ \midrule
    \multirow{4}{40pt}{Multi-Task Learning (MTL)} 
        & MOOMTL [\citeyear{sener2018multi}]\!\!\! & \xmark & \xmark & \cmark \\
        & PMTL [\citeyear{lin2019pareto}] & \xmark & \xmark & \cmark \\
        & EPO [\citeyear{mahapatra2020multi}] & \xmark & \xmark & \cmark \\
        & EPSE [\citeyear{ma2020efficient}] & \xmark & \xmark & \cmark \\
        & PHN [\citeyear{navon2021learning}] & \xmark & \xmark & \cmark \\ \midrule
        Ours & \bf SUHNPF & \cmark & \cmark & \cmark \\ \bottomrule        
    \end{tabular}}
    \label{tab:comp}
    \vspace{-1em}
\end{table}

Examples include enhanced scalarization approaches such as NBI \citep{das1998normal}, mCHIM \citep{ghane2015new}, and PK \citep{pirouz2016computational}. NBI produces an evenly distributed set of Pareto points given an evenly distributed set of weights,  
% Furthermore, NBI produces Pareto points in the non-convex parts of the Pareto curve while being independent of the relative scales of the objective functions. 
using the concept of Convex Hull of Individual Minima (CHIM) to break down the boundary/hull into evenly spaced segments before tracing the {\em weak} Pareto points. mCHIM improves upon NBI via a quasi-normal procedure to update the aforementioned CHIM set iteratively, to obtain a strong Pareto set. PK uses a local $\epsilon$-scalarization based strategy that searches for the Pareto front using controllable step-lengths in a restricted search region, thereby accounting for non-convexity. 

\textbf{Multi-Task Learning (MTL)}. Recent MTL works % \citep{sener2018multi,lin2019pareto,mahapatra2020multi,ma2020efficient,navon2021learning} 
have devised Pareto solvers for estimating high-dimensional neural models. % without prior knowledge of the parameter space. %However, these methods typically do not support non-convex problems or constraints (Section \ref{sec:related}), nor report their accuracy on benchmark problems with known analytic solutions for the Pareto front. 
%
%Unlike OR methods, MTL methods are highly scalable. % in estimating high-dimensional neural models.
%with little prior knowledge over the range and precision of the parameter space. 
%
MOOMTL \citep{sener2018multi} effectively scales via a multi-gradient descent approach, but does not guarantee an even spread of solution points found along the Pareto front. PMTL \citep{lin2019pareto} addresses this spread issue by dividing the functional domain into equal spaced cones, but this increases computational complexity as the number of cones increases. EPO \citep{mahapatra2020multi} extends preference rays along specified weights to find Pareto points evenly spread in the vicinity of the rays. EPSE \citep{ma2020efficient} uses a combination Hessian of the functions and Krylov subspace to find Pareto solutions. 

MTL methods rely upon KKT %Karush-Kuhn-Tucker (KKT) 
conditions %\citep{boyd2004convex} 
to check for optimality, which assumes convexity (see earlier LS discussion). While methods seek an even distribution of Pareto points by dividing the functional space into evenly spaced cones or preference rays, % (corresponding to different trade-offs over objectives), 
our results on a non-convex benchmark problem clearly show an uneven point spread (\textbf{Section~\ref{sec:case1}}). %This illustrates limitations of this approach in handling non-convexity. 
%
%MTL methods such as MOOMTL \citep{sener2018multi}, PMTL \citep{lin2019pareto}, EPSE \citep{ma2020efficient}, and EPO \citep{mahapatra2020multi} 
Moreover, most MTL methods %such as MOOMTL, PMTL, EPO, and EPSE 
are {\em point-based solvers}, meaning they must be run $P$ times to find $P$ points. %produce $P$ Pareto candidates. 
This is too expensive to adjust trade-off preferences at run-time. %Moreover, these methods require \hl{specialized local initialization} (using cones, rays, or other domain partitioning strategies) to find an even spread of Pareto points across the feasible set of saddle points. 

{\bf Pareto front learning}. PFL methods \citep{navon2021learning,lin2021controllable,singh2021hybrid} induce  the  full  Pareto  manifold  at  train-time so that users can quickly select any desired optimal trade-off point at run-time. For example, a manifold model trained on $P$ Pareto points might then quickly produce any number of additional Pareto points via interpolation. Of course, quality training data quality is necessary to learn an accurate, supervised prediction model. The method and resulting accuracy of the Pareto points used for model training is thus crucial to prediction accuracy.


\citet{navon2021learning}'s PHN considers two way to acquire Pareto training points: LS and EPO [\citeyear{mahapatra2020multi}]. \citet{lin2021controllable} use their PMTL [\citeyear{lin2019pareto}] method to identify Pareto points for training. \citet{singh2021hybrid}'s HNPF uses the Fritz-John conditions (FJC) \citep{marucsciac1982fritz} to identify Pareto points.

Like other OR methods, HNPF provides a theoretical guarantee of Pareto front accuracy within a user-specified error tolerance.  In evaluation on canonical OR benchmark problems, HNPF was shown to recover known Pareto fronts across various non-convex MOO problems while also being more efficient in finding Pareto points than NBI [\citeyear{das1998normal}], mCHIM [\citeyear{ghane2015new}], and PK [\citeyear{pirouz2016computational}]). However, like other OR methods, HNPF cannot scale to learn optimal high-dimensional neural model weights for MOO problems. % like other OR methods, HNPF's search of the variable domain cannot scale to learning neural models for MOO problems. 

\citet{ha2016hypernetworks}'s hypernetworks proposed training one neural model to generate effective weights for a second, target model. \citet{navon2021learning} and \citet{lin2021controllable} apply this approach to learn a manifold mapping MOO solutions to different target model weights, enabling the target model to achieve the desired Pareto trade-off for the MOO problem. 
%Specifically, \citeauthor{navon2021learning}'s Pareto HyperNetwork (PHN) approach builds on \citet{ha2016hypernetworks}'s  hypernetworks: training one neural model to generate effective weights for a second, target model. 
However, HNPF cannot be similarly applied to MTL problems due to its lack of scalability. % in acquiring Pareto points for model training. %We address this deficit in this work. %Details of HNPF are further discussed in Section \ref{sec:hnpf}.

\vspace{-1em}
\section{Preliminaries} \label{sec:fjc}
\vspace{-1em}

{\bf Fritz John Conditions (FJC)}. Let the objective and constraint function in Eq. \eqref{eq:multi} be differentiable once at a decision vector $x^* \in \mathcal{S}$. The Fritz-John \citep{levi2006application} necessary conditions for $x^*$ to be {\em weak} Pareto optimal is that vectors must exists for $0 \leq \lambda \in \mathbb{R}^k$, $0 \leq \mu \in \mathbb{R}^m$ and $(\lambda, \mu) \neq (0,0)$ (not identically zero) \textit{s.t.} the following holds:
\vspace{-2em}
\begin{align*}
    \sum_{i=1}^k \lambda_i \nabla f_i(x^*) + \sum_{j=1}^m \mu_j \nabla g_j(x^*) = 0 \numberthis \label{eq:fjcond} \\
    \mu_jg_j(x^*) = 0, \forall j=1,\ldots,m
\end{align*}
\vspace{-2em}

\citet{gobbi2015analytical} present an $L$ matrix form of FJC:
\vspace{-0.5em}
\begin{align*}
&L = \begin{bmatrix}
\nabla F & \nabla G \\
\mathbf{0} & G
\end{bmatrix} \quad [(n+m) \times (k+m)] \label{eq:fjmat} \numberthis \\
&\nabla F_{n \times k} = [\nabla f_1, \ldots, \nabla f_k]\\
&\nabla G_{n \times m} = [\nabla g_1, \ldots, \nabla g_m]\\
&G_{m \times m}=diag(g_1,\ldots,g_m) 
\end{align*}
\vspace{-2.5em}

comprising the gradients of the functions and constraints. The matrix equivalent of FJC for $x^*$ to be Pareto optimal is to show the existence of $\delta = (\lambda, \mu) \in \mathbb{R}^{k+m}$ (\ie $\mathbf{\delta}$ not identically zero) in Eq. \eqref{eq:fjcond} such that:
\vspace{-0.5em}
\begin{align}
    L \cdot \delta = 0 \quad \text{s.t.} \quad L=L(x^*),\mathbf{\delta} \geq 0, \mathbf{\delta} \neq 0 \label{eq:fjmatrix}
\end{align}
\vspace{-0.5em}
Therefore the non-trivial solution for Eq. \eqref{eq:fjmatrix} is:
\begin{align}
    det(L^TL)=0 \label{eq:paropt}
\end{align}
\vspace{-2.5em}

\begin{remark}
    \small If $f_i$s and $g_j$s are continuous and differentiable once, then the set of weak Pareto optimal points are $x^*=\{x|det(L(x)^TL(x))=0\}$, $\delta \geq 0$ for a non-square matrix $L(x)$, and is equivalent to $x^*=\{x|det(L(x))=0\}$, $\delta \geq 0$, for a square matrix $L(x)$. \textup{See \textbf{Appendix C} for a proof of the above for the unconstrained setting only.}
    \end{remark}
\vspace{-1.0em}

\textbf{Hybrid Neural Pareto Front (HNPF)}. % \label{sec:hnpf}
% \subsection{Hybrid Neural Pareto Front (HNPF)} \label{sec:hnpf}
% \vspace{-1.0em}
Like other Pareto front learning (PFL) methods, HNPF \citep{singh2021hybrid} learns a neural Pareto manifold from training data. With HNPF, Pareto points are acquired from training data via Fritz-John conditions. In particular, once a given a data point from the input variable domain is mapped to the output function domain (via objective functions), FJC
%\citep{marucsciac1982fritz,gobbi2015analytical}
are tested to determine Pareto optimality for that point. %whether or not the output point lies on the Pareto front.  

HNPF's neural network first identifies {\em weak Pareto} points via feed-forward layers to smoothly approximate the \textit{weak} Pareto optimal solution manifold $M(X^*)$ as $\tilde{M}(\tilde{X},\Phi)$. %, shown in \textbf{Fig. \ref{fig:arch}}. 
The last layer of the network has two neurons with \textit{softmax} activation for binary classification of Pareto \vs non-Pareto points, distinguishing sub-optimal points from the {\em weak} Pareto points. % in the feasible set $S$. Note that t
The network loss is representation driven, since the Fritz John discriminator (Eq. \eqref{eq:paropt}), described by the objective functions and constraints, explicitly classifies each input data point $X_i$ as being {\em weak} Pareto or not.  
%

%
After identifying weak Pareto points, HNPF uses an efficient Pareto filter to find the subset of \textit{non-dominated} points. 
% ML 2/23/22: this is a contribution of HNPF, not our paper, so omitting for space
%Given $P$ points in the weak Pareto set, HNPF's filter reduces the filtering cost from $\mathcal{O}(P!)$ to $\mathcal{O}(Pkh)$, with $h$ as the functional domain precision parameter. 

% Given objective functions and constraints, HNPF  first samples data points from the variable domain to test for optimality. 
HNPF's scalability bottleneck lies in how it samples variable domain points to test for Pareto optimality in model training. If there are any direct constraints on variable values, this naturally restricts the feasible domain for sampling. However, lacking any prior distribution on where to find Pareto optima, HNPF performs uniform random sampling in the variable domain to ensure broad coverage for locating optima. For small benchmark problems with known variable domains, this suffices. However, it is infeasible to apply this to find optimal model parameters for a neural MOO model.

\vspace{-1em}
\section{Scalable Unidirectional HNPF}
\vspace{-1.0em}

To address HNPF's scalability bottleneck, we introduce SUHNPF, a scalable variant of HNPF for finding weak Pareto points with an arbitrary density and distribution of initial data points. This is achieved via a scalable unidirectional FJC-guided double-gradient descent algorithm that encompasses HNPF's neural manifold estimator. Given continuous differentiable loss functions, SUNHPF's guided double gradient descent strategy efficiently searches the variable domain to find Pareto optimal points in the function domain. This enables SUHNPF to learn an $\epsilon$-bounded approximation $\tilde{M}(\Theta^*)$ to the weak Pareto optimal manifold. 

%The filtering stage of the existing HNPF framework, which filters out any weak points that are dominated, remains unchanged.

\vspace{-1.0em}
\subsection{FJC-Guided Double Gradient Descent}
\vspace{-1.0em}

Constructing a classification manifold of Pareto \vs non-Pareto points requires a set of feasible points to represent both classes. Since the Pareto manifold is unknown \textit{a priori}, feasible points are drawn from a random distribution (lacking an informed prior) to initialize both classes. 
%Alternatively, one can initialize $\mathcal{P}1$ via some informed prior. 
We then refine the points in the Pareto class $\mathcal{P}1$ while holding the non-Pareto points $\mathcal{P}0$ constant.

% We now want to learn to distinguish the {\em foreground} distribution for the actual Pareto class ($\mathcal{P}1$) \vs this non-Pareto ($\mathcal{P}0$) {\em background} distribution.

% \hl {For the practical setting of estimating high-dimensional neural model parameters, we must further estimate a feasible set for sampling each model weight. To do so, we individually optimize each objective function $f_{1:k}$, identifying the model weights achieving each optimum. Across the $k$ objectives optimized, we then use the minimum and maximum values observed for each model weight as its feasible domain for sampling.}\sg{Needed here or better pushed to neural moo section??} 

We assume an equal-sized sample set of $P$ points for each class, which helps to address class imbalance for harsh cases. For benchmark problems where the feasible set over the variable domain is known, we randomly sample points over this feasible domain to initialize $\mathcal{P}1$ and $\mathcal{P}0$. Given these input points $x$, held constant for $\mathcal{P}0$ and used as initial seed values for $\mathcal{P}1$, \textbf{Alg. \ref{alg:fjc}} specifies our FJC-guided double-gradient descent algorithm. The algorithm iteratively %make $\mathcal{P}0$ represent the non-Pareto points and make 
updates $\mathcal{P}1$ towards the Pareto manifold via FJC-guided descent. The training dataset $D$ is the union of $\mathcal{P}0 \cup \mathcal{P}1$. The algorithm iterates over Steps 5-9 until the error ($err$) converges to the user-specified error tolerance ($\epsilon_{outer}$).
\vspace{-0.5em}
\begin{equation}
    err = \sum_{p \in \mathcal{P}1} \left (det(L^TL) \right )^2 \label{eq:detloss}
\end{equation}
\vspace{-1.5em}

%start with a random sample of $2P$ points randomly sampled from the variable domain. We split points into two equal-sized sets of size $P$, representing seed candidates for Pareto ($\mathcal{P}1$) and non-Pareto ($\mathcal{P}0$) classes.
%: $\mathcal{P}0$ and $\mathcal{P}1$, each of size $P$, representing the seed candidates for Pareto and non-Pareto classes. 
%The set ($\mathcal{P}0$) is held constant, assuming randomly selected 

\begin{algorithm}[htb]
\small
	\caption{FJC-guided descent of variable domain}
	\begin{algorithmic}[1]
	\BState \textbf{Input}: Data $D = \mathcal{P}0 \cup \mathcal{P}1$ \Comment{Training Data}
	\BState \textbf{Input}: Functions $F$ and Constraints $G$
	\BState \textbf{Input}: Error tolerance $\epsilon_{outer}$, $\epsilon_{inner}$
%	\State Split $2P$ points evenly into $\mathcal{P}0$ and $\mathcal{P}1$ randomly
%	\State $D = \mathcal{P}0 \cup \mathcal{P}1$ \Comment{Form Training Data}
	\While {$err > \epsilon_{outer}$} \Comment{Run until convergence}
	    \State Train network using $D$ as data for $e$ epochs
	    \State Compute current error $err$ %\Comment{current error}
	    \State Compute $\nabla_{p} det = \frac{\partial det(L^TL)}{\partial p}$, $\forall p \in \mathcal{P}1$
	    \State $\mathcal{P}1 \leftarrow \mathcal{P}1 - \eta \nabla det$ \Comment{Update points in $\mathcal{P}1$}
	    \State $D = \mathcal{P}0 \cup \mathcal{P}1$ \Comment{Update Training Data}
	\EndWhile
	\BState \textbf{Output}: Weak Pareto manifold $\tilde{M}$ 
	\end{algorithmic} \label{alg:fjc}
\end{algorithm}
% \vspace{-1em}

Eq. \ref{eq:detloss} in Alg.~\ref{alg:fjc} ensures that all of the points in the Pareto set ($p \in \mathcal{P}1$) are optimal once we converge to the desired error tolerance $\epsilon$. Hence, Step 7 computes gradients of the $det(L^TL)$ matrix \wrt the variables at points $p \in \mathcal{P}1$ and creates an approximation of the $\nabla det$ matrix. %Alg. \ref{alg:fjc} iteratively updates the points $\mathcal{P}1$ until they reach the Pareto manifold within the desired tolerance. 
The training data $D$ is then updated with the new values of $\mathcal{P}1$. 
% Note that $\mathcal{P}0$ remains unchanged throughout the process. 
The output is an approximation of the true weak Pareto manifold $M$ as $\tilde{M}$ on the discrete dataset $D \subset X$. Note that in Step 8, we do not allow the point set $\mathcal{P}1$ to leave the feasible set $\mathcal{S}$ \ie if the step crosses the boundary of the feasible set, then we update the point to be the point on the boundary.

Alg. \ref{alg:fjc} includes two separate gradient descent steps. The outer descent loop (Step 4-9) updates the candidate point set $\mathcal{P}1$ using the error measurement of $err$ through a squared loss in Eq. \ref{eq:detloss}. The inner descent (Step 5) updates the parameters ($\Phi$) of the neural net to closely approximate the Pareto manifold $M(X)$ as $\tilde{M}(X,\Phi)$. This is done using the Binary Cross Entropy Loss on ($det(L(X)^TL(X)),\tilde{M}(X)$), and reaches convergence only when $BCE \leq \epsilon_{inner}$. The \textit{unidirectional} property of this double-gradient update lets the outer loop influence the inner loop but not vice-versa.

%{\color{blue} 
{\bf Complexity Analysis}. The time complexity of the proposed Alg.~\ref{alg:fjc} is $\mathcal{O}(\mathcal{P}(k+m)^2n + \mathcal{P}(k+m)^3)$. Under a practical deep MTL, $n \gg k,m$ (\ie variable dimension is strictly greater than the number of functions and constraints in any neural setting), the complexity is dominated by the term $\mathcal{O}(\mathcal{P}(k+m)^2n)$, where the scaling is linear in terms of the variable dimension $n$, and quadratic in the number of functions and constraints $k,m$. The space complexity is $O(n(k+m+P)+(k+m)^2)$. SUHNPF achieves better memory and run-time efficiency since it does not rely upon solving primal and dual problems used in MTL methods, with detailed analysis in \textbf{Appendix I} and \textbf{Appendix J}.
%}

\vspace{-1em}
\section{Benchmarking} \label{sec:benchmark}
\vspace{-1em}
% \subsection{Motivation and Context}
% \vspace{-1em}

{\bf Motivation.} Lack of analytical solutions to real MOO problems makes it difficult to measure the true accuracy of any Pareto solver. Consequently, we follow the OR literature in advocating that the correctness of any proposed Pareto solver should first be tested on constructed benchmark problems with known analytic solutions. This is also consistent with broader ML community practice of first evaluating proposed methods across a range of simulated, controlled conditions to verify correctness, often yielding valuable insights into model behavior prior to evaluation on real data.

We consider three such benchmark problems (Cases I-III). These problems are non-convex in either the functional or variable domain, or due to constraints (\textbf{Table \ref{tab:cases}}). Note that {\em whether or not the Pareto front itself is non-convex is not always the best indicator of benchmark difficulty}. For example, even though both objectives are non-convex in Case II, the Pareto front is still convex. As we shall see, PHN \citep{navon2021learning} fails on Case II despite performing well on two benchmark problems in their own study having a non-convex front. In general, non-convexity can greatly challenge MTL approaches relying on KKT conditions in testing solutions for optimality (see \textbf{Appendix E}).

\begin{table}[bht]
    \centering
    \vspace{-0.5em}
    \caption{\small Characterization of benchmark cases, including convexity (C) \vs non-convexity (NC) in variable and function domains.}
%    Analysis of the three cases chosen and performance evaluation of OR and MTL methods \vs SUHNPF.}
    \vspace{-1em}
    \resizebox{\columnwidth}{!}{%
    \begin{tabular}{r|rccc|ccc}
    \toprule
        \bf \!\!Case\!\! & \!\!{\bf Dim}\!\! & \begin{tabular}[c]{@{}c@{}}\bf Variable\\ \bf Domain\end{tabular} & \begin{tabular}[c]{@{}c@{}}\bf Function\\\bf Domain\end{tabular} & \begin{tabular}[c]{@{}c@{}}\bf Includes\\ \bf \!\!Constraints\!\!\end{tabular} & \begin{tabular}[c]{@{}c@{}}\bf OR\\\bf Methods\end{tabular} & \begin{tabular}[c]{@{}c@{}}\bf MTL\\\bf Methods\end{tabular} & \bf SUHNPF \\ \midrule
        I & 2 & Linear & NC & No & Sparse, Slow & \!\!\!Sparse, Fast\!\!\! & Dense, Fast \\
        II & 30 & NC & C & No & Sparse, Slow & Fail & Dense, Fast \\
        III & 2 & NC & NC & Yes & Sparse, Slow & Fail & Dense, Fast \\ \bottomrule        
    \end{tabular}}
    \label{tab:cases}
%    \vspace{-1em}
\end{table}

{\bf Experimental Setup}. For each Case I-III, each method is tasked with finding $P=50$ Pareto points. OR methods search until any $P$ Pareto points are found. MTL methods divide the functional search quadrant into cones/rays, seeking one Pareto point per split. Manifold-based methods (PHN, HNPF, and SUHNPF) search for $P$ Pareto points in order to learn the manifold. Ideally, each method should identify an even spread (\ie broad coverage) of points across the true Pareto front (shown in grey in each figure) in order to faithfully approximate it. We report the runtime taken by each solver to find the points.

% We report the number of candidates evaluated for optimality during the iterative sequence for each solver.

SUHNPF starts with $P$ random candidates that are progressively refined via its guided, double gradient descent strategy. Following HNPF \citep{singh2021hybrid}, we adopt the same error tolerance $10^{-4}$ for both $\epsilon_{outer}$ and $\epsilon_{inner}$. Any point $x$ that satisfies $|det(L(x)^TL(x))| \leq \epsilon_{inner}$ is thus classified as being Pareto (exact zero is often impossible given finite machine precision). Sourcecode for LS, MOOMTL, PMTL and EPO solvers are taken from EPO's repository, while EPSE and PHN's sourcecode are used for them, respectively (see \textbf{Appendix D}). Based on \cite{navon2021learning}'s findings, we evaluate the more accurate PHN variant, PHN-EPO, which we refer to simply as PHN. 

Due to key differences between OR \vs MTL methods, results for each group are presented separately. First, OR methods not only support the full range of non-convex conditions across Cases I-III, but provide error tolerance parameters to guarantee correctness (and our experiments confirm this). Consequently, we report only the efficiency of OR  methods in Table \ref{tab:evalsOR}. In contrast, MTL methods produced variable accuracy on Case I and failed entirely on Cases II and III (as shall be discussed). Consequently, Table \ref{tab:evalsMTL} reports accuracy and efficiency of MTL methods for Case I only. 
%As noted earlier, OR methods cannot scale to MTL problems, and such scalability is key hallmark strength of the MTL methods.

\textbf{Appendix D} discusses experimental setup, \textbf{Appendix G} has two other benchmarks, and \textbf{Appendix H} has loss profiles.

\vspace{-1.0em}
\subsection{Case I: \cite{ghane2015new}} \label{sec:case1}
\vspace{-2.0em}

\resizebox{\linewidth}{!}{
\begin{minipage}{\linewidth}
\begin{align*}
    &f_1(x_1,x_2) = x_1 , \, f_2(x_1,x_2) = 1 + x_2^2 - x_1 - 0.1sin 3 \pi x_1\\
    &\text{s.t.} \quad g_1: 0 \leq x_1 \leq 1, g_2: -2 \leq x_2 \leq 2
\end{align*}
\end{minipage}
}
\vspace{-0.5em}

The analytical Pareto solution to this joint minimization problem is $M: 0 \leq x_1 \leq 1,x_2=0$. In \textbf{Fig. \ref{fig:illus-var}} we observe SUHNPF's randomly generated point set $\mathcal{P}1$ (red dots) converges towards the true manifold $M$ as a discrete approximation $\tilde{M}$. Point set $\mathcal{P}0$ (blue dots) is held constant and serves as representatives for the (background) non-Pareto class. Iteration 5 is the last because the error falls below the user-specified $\epsilon$. The final cardinality of the weak Pareto set $|\mathcal{P}1| = P$ and any $\mathcal{P}0$ point that happens to fall within the $\epsilon_{outer}$ threshold. Hence Alg. \ref{alg:fjc} ensures 100\% Pareto point density in $\mathcal{P}1$, a vast improvement from HNPF \citep{singh2021hybrid}, where only $\approx$ 2\% density was achieved. % for this problem.
%$10k$ training data points were needed to produce a density of $\sim \mathbf{2}\%$.
\textbf{Fig. \ref{fig:illus-func}} shows functional domain convergence. %The Pareto manifold is linear in the variable domain, while being non-convex in the functional domain. 
SUHNPF achieves an even spread of points in the non-convex portion of the front.

\begin{figure}[ht]
    \centering
    \vspace{-1.0em}
     \begin{subfigure}{0.49\linewidth}
      \centering
      \includegraphics[width=\linewidth]{figs/illus-form2-iter0.png}
      \vspace{-2em}
      \caption{Iteration 0 (Start)}
    \end{subfigure}
    % \begin{subfigure}{0.49\linewidth}
    %   \centering
    %   \includegraphics[width=\linewidth]{figs/illus-form2-iter1.png}
    %   \vspace{-2em}
    %   \caption{Iteration 1}
    % \end{subfigure}
    % \begin{subfigure}{0.49\linewidth}
    %   \centering
    %   \includegraphics[width=\linewidth]{figs/illus-form2-iter2.png}
    %   \vspace{-2em}
    %   \caption{Iteration 1}
    % \end{subfigure}
    % \begin{subfigure}{0.49\linewidth}
    %   \centering
    %   \includegraphics[width=\linewidth]{figs/illus-form2-iter3.png}
    %   \vspace{-2em}
    %   \caption{Iteration 3}
    % \end{subfigure}
    % \begin{subfigure}{0.49\linewidth}
    %   \centering
    %   \includegraphics[width=\linewidth]{figs/illus-form2-iter4.png}
    %   \vspace{-2em}
    %   \caption{Iteration 4}
    % \end{subfigure}
    \begin{subfigure}{0.49\linewidth}
      \centering
      \includegraphics[width=\linewidth]{figs/illus-form2-iter5.png}
      \vspace{-2em}
      \caption{Iteration 5 (Converged)}
    \end{subfigure}
    \vspace{-1em}
    \caption{\small Case I: Variable domain. The gray line show the true analytic solution ($0 \leq x_1 \leq 1$). SUHNPF Pareto candidates $\mathcal{P}1$ (red dots) converge in 5 iterations. Non-Pareto candidates $\mathcal{P}0$ (blue dots) are held constant throughout the iterative sequence.}
%    Convergence of the FJC guided algorithm for Case I variable domain. The analytical solution (grey line) for this problem is $0 \leq x_1 \leq 1, x_2=0$, which matches the set of Pareto candidates (red dots).}
    \label{fig:illus-var}
    \vspace{-1em}
\end{figure}

\begin{figure}[ht]
    \centering
    % \vspace{-0.5em}
    \begin{subfigure}{0.49\linewidth}
      \centering
      \includegraphics[width=\linewidth]{figs/iter0-func-form2.png}
      \vspace{-2em}
      \caption{Iteration 0 (Start)}
    \end{subfigure}
    % \begin{subfigure}{0.49\linewidth}
    %   \centering
    %   \includegraphics[width=\linewidth]{figs/iter1-func-form2.png}
    %   \vspace{-2em}
    %   \caption{Iteration 1}
    % \end{subfigure}
    % \begin{subfigure}{0.45\linewidth}
    %   \centering
    %   \includegraphics[width=\linewidth]{figs/iter2-func-form2.png}
    %   \vspace{-2em}
    %   \caption{Iteration 2}
    % \end{subfigure}
    % \begin{subfigure}{0.49\linewidth}
    %   \centering
    %   \includegraphics[width=\linewidth]{figs/iter3-func-form2.png}
    %   \vspace{-2em}
    %   \caption{Iteration 3}
    % \end{subfigure}
    % \begin{subfigure}{0.45\linewidth}
    %   \centering
    %   \includegraphics[width=\linewidth]{figs/iter4-func-form2.png}
    %   \vspace{-2em}
    %   \caption{Iteration 4}
    % \end{subfigure}
    \begin{subfigure}{0.49\linewidth}
      \centering
      \includegraphics[width=\linewidth]{figs/pareto-form2.png}
      \vspace{-2em}
      \caption{Iteration 5 (Converged)}
    \end{subfigure}
    \vspace{-1em}
    \caption{\small Case I: Functional domain corresponding to Figure \ref{fig:illus-var}. SUHNPF Pareto candidates $\mathcal{P}1$ (red dots) converge in 5 iterations.}
    \label{fig:illus-func}
    \vspace{-1em}
\end{figure}

% \ml{clarify this paragraph} 
% Since we have a discrete approximation of the Pareto manifold $\tilde{M}$, we can plot the functional domain in \textbf{Fig. \ref{fig:illus-func}}. 
% Note that all points are distinct since our solution is a manifold based strategy, unlike other point based solution where some explicit additional criteria needs to be enforced for ensuring distinct points. 
\textbf{Fig. \ref{fig:case1-mtl}} presents results for Linear Scalarization (LS) and several MTL methods: MOOMTL, PMTL, EPO, EPSE, and PHN. Refer to \textbf{Appendix F} for iterative convergence plots for Case I, and \textbf{Appendix K} for evaluation measures on \textit{uniformity} and \textit{coverage} for the compared methods. LS successfully produces a number of points in the non-convex portions of the front, despite prior studies %\cite{sener2018multi,lin2019pareto,mahapatra2020multi,navon2021learning} 
often asserting that LS cannot handle any non-convexity. Refer to \textbf{Appendix M} for analysis and justification.
%LS finds points in the non-convex portion of the front, contradicting the claim made in MTL work. Since PHN relies on the EPO solver, any mistakes in EPO is reflected in the manifold of PHN.

To check for optimality, MTL methods rely upon KKT conditions that implicitly assume convexity (see Section \ref{sec:related}). The non-convex nature of $f_2$ is thus challenging for these KKT-based methods. For example, some methods seek an even distribution of Pareto points by breaking up the functional space into evenly spaced cones or preference rays for trade-off values $\alpha$. However, the uneven point spread seen on this non-convex benchmark illustrates limitations of the cone-based approach in handling non-convexity. We also clearly see non-Pareto points produced by some methods. 
%these methods are challenged to achieve an even spread for this non-convex case. 
%Hence these MTL solvers fail to converge (error tolerance).
% for different trade-off values $\alpha$}\ml{clarify}. 

\begin{figure}[ht]
    \centering
    %\vspace{-1em}
     \begin{subfigure}{0.49\linewidth}
      \centering
      \includegraphics[width=\linewidth]{figs/ls.png}
      \vspace{-2em}
      \caption{Linear Scalarization (LS)}
    \end{subfigure}
    \begin{subfigure}{0.49\linewidth}
      \centering
      \includegraphics[width=\linewidth]{figs/moomtl.png}
      \vspace{-2em}
      \caption{MOOMTL}
    \end{subfigure}
    \begin{subfigure}{0.49\linewidth}
      \centering
      \includegraphics[width=\linewidth]{figs/pmtl.png}
      \vspace{-2em}
      \caption{PMTL}
    \end{subfigure}
    \begin{subfigure}{0.49\linewidth}
      \centering
      \includegraphics[width=\linewidth]{figs/epo.png}
      \vspace{-2em}
      \caption{EPO}
    \end{subfigure}
    \begin{subfigure}{0.49\linewidth}
      \centering
      \includegraphics[width=\linewidth]{figs/epse.png}
      \vspace{-2em}
      \caption{EPSE}
    \end{subfigure}
    \begin{subfigure}{0.49\linewidth}
      \centering
      \includegraphics[width=\linewidth]{figs/phn.png}      
      \vspace{-2em}
      \caption{PHN}
    \end{subfigure}
    %\vspace{-1em}
    \caption{\small Case I: function domain for LS and MTL methods. No method produces all $50$ of the requested Pareto points. PMTL, EPO and PHN also find non-Pareto points (circled in blue). Methods vary greatly in their coverage of points spanning the true front.}
    \label{fig:case1-mtl}
    %\vspace{-1em}
\end{figure}

\vspace{-1.5em}
\subsection{Case II: \cite{zhang2008multiobjective}}
\vspace{-2em}

\resizebox{\linewidth}{!}{
\begin{minipage}{\linewidth}
\begin{align*}
    &f_1(x) = x_1 + \frac{2}{|J_1|}\sum_{j \in J_1}y_j^2 \hspace{1em},\hspace{1em} f_2(x) = 1 - \sqrt{x_1} + \frac{2}{|J_2|}\sum_{j \in J_2}y_j^2 \\
    &\text{s.t.} \quad g_1,\ldots,g_{30}: 0 \leq x_1 \leq 1, -1 \leq x_j \leq 1, j=2,\ldots,m\\
    & J_1=\{j|j \, \textrm{is odd},2 \leq j \leq m\},J_2=\{j|j \, \textrm{is even},2 \leq j \leq m\}\\
    & y_j = \left\{\begin{matrix}
            x_j - [0.3x_1^2 \cos(24\pi x_1 + \frac{4j\pi}{m}) + 0.6x_1] cos(6\pi x_1 + \frac{j\pi}{m}) \quad j \in J_1   \\ 
            x_j - [0.3x_1^2 \cos(24\pi x_1 + \frac{4j\pi}{m}) + 0.6x_1] cos(6\pi x_1 + \frac{j\pi}{m}) \quad j \in J_2
\end{matrix}\right.
\end{align*}
\end{minipage}
}
\vspace{-0.5em}

This joint minimization case operates in a $n=30$ dimensional variable space. \textbf{Fig. \ref{fig:illus-case3}} shows the true Pareto front and SUHNPF convergence in the variable domain. Note the non-convexity in the variable domain, where $x_1$ varies uniformly between $[0,1]$, while $x_2,\ldots,x_{30}$ are sinusoidal in nature guided by $x_1$. Thus, the Pareto manifold has a spiral trajectory along $x_2,\ldots,x_{30}$ with evolution along $x_1$. 

Despite the Pareto front being convex, the objectives are non-convex. For MTL methods, the {\small \texttt{min\_norm\_solver}} \citep{sener2018multi}, which is integral to all MTL solvers, simply fails. Consequently, no MTL results are reported. % for MTL methods for this case.

For SUHNPF, following random initialization (iteration 0) in Fig. \ref{fig:illus-case3} (a), we observe that the candidate set $\mathcal{P}1$ propagates more towards increasing values of $x_1$ in Fig. \ref{fig:illus-case3}, and approximates the expected Pareto manifold at iteration 5. 
% ML space - doesn't add anything
%\textbf{Fig. \ref{fig:illus-form3}} shows the corresponding functional domain.

\begin{figure}[ht]
    \centering
    \vspace{-0.5em}
    \begin{subfigure}{0.4\linewidth}
      \centering
      \includegraphics[width=\linewidth]{figs/form5-iter0.png}
      \vspace{-1.5em}
      \caption{Iteration 0 (Start)}
    \end{subfigure}
    % \begin{subfigure}{0.49\linewidth}
    %   \centering
    %   \includegraphics[width=\linewidth]{figs/form5-iter1.png}
    %   \vspace{-1.5em}
    %   \caption{Iteration 1}
    % \end{subfigure}
    % \begin{subfigure}{0.49\linewidth}
    %   \centering
    %   \includegraphics[width=\linewidth]{figs/form5-iter2.png}
    %   \vspace{-2em}
    %   \caption{Iteration 1}
    % \end{subfigure}
    % \begin{subfigure}{0.49\linewidth}
    %   \centering
    %   \includegraphics[width=\linewidth]{figs/form5-iter3.png}
    %   \vspace{-1.5em}
    %   \caption{Iteration 3}
    % \end{subfigure}
    % \begin{subfigure}{0.49\linewidth}
    %   \centering
    %   \includegraphics[width=\linewidth]{figs/form5-iter4.png}
    %   \vspace{-2em}
    %   \caption{Iteration 4}
    % \end{subfigure}
    \begin{subfigure}{0.4\linewidth}
      \centering
      \includegraphics[width=\linewidth]{figs/form5-iter5.png}
      \vspace{-1.5em}
      \caption{Iteration 5 (Converged)}
    \end{subfigure}
    \vspace{-1em}
    \caption{\small Case II: variable domain (SUHNPF). 
%    Convergence for Case II variable domain. The analytical solution for this problem matches the set of Pareto candidates (red dots), which is guided by a linear trajectory in $x_1$ and sinusoids in $x_2,x_3$ respectively. 
We restrict the four plots to three dimensions ($x_1$, $x_2$, and $x_3$) for visualization.}
    \label{fig:illus-case3}
    \vspace{-1.5em}
\end{figure}

\begin{figure}[ht]
    \centering
    \vspace{-1em}
    \begin{subfigure}{0.4\linewidth}
      \centering
      \includegraphics[width=\linewidth]{figs/iter0-form3-func.png}
      \vspace{-2em}
      \caption{Iteration 0 (Start)}
    \end{subfigure}
    % \begin{subfigure}{0.49\linewidth}
    %   \centering
    %   \includegraphics[width=\linewidth]{figs/iter1-form3-func.png}
    %   \vspace{-2em}
    %   \caption{Iteration 1}
    % \end{subfigure}
    % \begin{subfigure}{0.45\linewidth}
    %   \centeringNo comparison against MTL methods are shown for Case II, since the \textit{min\_norm\_solver}, which is an integral part of all MTL solvers, fails for this case. 
    %   \includegraphics[width=\linewidth]{figs/iter2-func-form2.png}
    %   \vspace{-2em}
    %   \caption{Iteration 2}
    % \end{subfigure}
    % \begin{subfigure}{0.49\linewidth}
    %   \centering
    %   \includegraphics[width=\linewidth]{figs/iter3-form3-func.png}
    %   \vspace{-2em}
    %   \caption{Iteration 3}
    % \end{subfigure}
    % \begin{subfigure}{0.45\linewidth}
    %   \centering
    %   \includegraphics[width=\linewidth]{figs/iter4-func-form2.png}
    %   \vspace{-2em}
    %   \caption{Iteration 4}
    % \end{subfigure}
    \begin{subfigure}{0.4\linewidth}
      \centering
      \includegraphics[width=\linewidth]{figs/iter5-form3-func.png}
      \vspace{-2em}
      \caption{Iteration 5 (Converged)}
    \end{subfigure}
    \vspace{-1.0em}
    \caption{\small Case II: functional domain (SUHNPF).}
    \label{fig:illus-form3}
    \vspace{-1.5em}
\end{figure}

\vspace{-2em}
\subsection{Case III: \cite{tanaka1995ga}}
\vspace{-2.5em}

\resizebox{\linewidth}{!}{
\begin{minipage}{\linewidth}
\begin{align*}
    &f_1(x_1,x_2) = x_1, \, f_2(x_1,x_2) = x_2\\
    &\text{s.t.} \quad g_1(x_1,x_2)= (x_1-0.5)^2 + (x_2-0.5)^2 \leq 0.5\\
    & g_2(x_1,x_2)= x_1^2 + x_2^2 - 1 - 0.1 \cos (16 \arctan ({x_1}/{x_2})) \geq 0\\
    &g_3,g_4:0 \leq x_1, x_2 \leq \pi
\end{align*}
\end{minipage}
}
\vspace{-0.5em}

For this joint minimization problem, the Pareto front is dominated by the two constraints $g_1$ and $g_2$, while linear functions $f_1$ and $f_2$ do not contribute to the Pareto optimal solution. \textbf{Fig. \ref{fig:illus-case2}} shows the convergence of SUHNPF Pareto candidates toward the known solution manifold. 

Because MTL approaches do not support constraints, they are not capable of solving this benchmark problem. However, note that if we were to remove constraints $g_1$ and $g_2$, $f_1$ and $f_2$ would then become independent of each other (and so not compete). The front then collapses to the point $(0,0)$, corresponding to the minimum of both functions. For this unconstrained problem, MTL methods would be expected to find this correct Pareto optimal solution point.

\begin{figure}[ht]
    \centering
    \vspace{-0.5em}
     \begin{subfigure}{0.49\linewidth}
      \centering
      \includegraphics[width=\linewidth]{figs/form3-iter0.png}
      \vspace{-2em}
      \caption{Iteration 0 (Start)}
    \end{subfigure}
    % \begin{subfigure}{0.49\linewidth}
    %   \centering
    %   \includegraphics[width=\linewidth]{figs/form3-iter1.png}
    %   \vspace{-2em}
    %   \caption{Iteration 1}
    % \end{subfigure}
    % \begin{subfigure}{0.49\linewidth}
    %   \centering
    %   \includegraphics[width=\linewidth]{figs/form3-iter2.png}
    %   \vspace{-2em}
    %   \caption{Iteration 1}
    % \end{subfigure}
    % \begin{subfigure}{0.49\linewidth}
    %   \centering
    %   \includegraphics[width=\linewidth]{figs/form3-iter3.png}
    %   \vspace{-2em}
    %   \caption{Iteration 3}
    % \end{subfigure}
    % \begin{subfigure}{0.49\linewidth}
    %   \centering
    %   \includegraphics[width=\linewidth]{figs/form3-iter4.png}
    %   \vspace{-2em}
    %   \caption{Iteration 4}
    % \end{subfigure}
    \begin{subfigure}{0.49\linewidth}
      \centering
      \includegraphics[width=\linewidth]{figs/form3-iter5.png}
      \vspace{-2em}
      \caption{Iteration 5 (Converged)}
    \end{subfigure}
    \vspace{-1.0em}
    \caption{\small Case III: variable domain. The analytical solution for this problem is driven by constraints $g_1,g_2$. SUHNPF Pareto candidates $\mathcal{P}0$ (red dots) converge to the true front.}
    \label{fig:illus-case2}
    \vspace{-0.5em}
\end{figure}

Case III highlights the need for any manifold based extractor to support both explicit and implicit forms of the Pareto front. Cases I and II have explicit form of front in the functional and variable domain. However, Case III has an implicit Pareto front (Fig. \ref{fig:illus-case2}) owing to constraints $g_1,g_2$, where they render an implicit relation between $x_1,x_2$ and therefore $f_1,f_2$. SUHNPF's ability to construct a full rank diffusive indicator function of Pareto \vs non-Pareto points enables it to approximate the true manifold.

\vspace{-1em}
\subsection{SUHNPF \vs OR and MTL Methods}
\vspace{-1em}

\textbf{Table \ref{tab:evalsOR}} reports the runtime of OR methods \vs SUHNPF to find $P=50$ Pareto points for Cases I-III. Because OR methods and SUHNPF all return $P$ true Pareto points, we compare methods on efficiency only.

Cases I and III have a 2D ($n=2$) variable domain, where SUHNPF takes $1$s  per epoch, with $2$ epochs for training in Step 7 of Alg. \ref{alg:fjc}. Both the cases took $5$ epochs to converge, resulting in a total run-time of $10$s. Case II has a 30D ($n=30$) variable domain where SUHNPF takes $2$s per epoch resulting in a total run-time of $20$s.

Note that HNPF \citep{singh2021hybrid} was shown to scale better with variable dimension $n$ in comparison to prior OR methods (\eg see HNPF's Figure 9). Table \ref{tab:evalsOR} appears consistent with this: HNPF takes longer than OR baselines for variable dimension $n=2$ (Cases I and III) but is much faster with variable dimension $n=30$ (Case II).

\begin{table}[bht]
	\centering
	\vspace{-0.5em}
	\caption{\small Runtime (secs) for SUHNPF \vs OR methods.}
	\vspace{-1em}
 	\resizebox{\columnwidth}{!}{%
		\begin{tabular}{l|c|rrrr|r}
			\toprule
			\bf Method & $n$ & NBI & mCHIM & PK & HNPF & \bf SUHNPF \\\midrule
			Case I & 2 & 14 & 13 & 13 & 45 & 10 \\ 
			Case II & 30 & 243,344 & 67,610 & 46,808 & 3,960 & 20 \\ 
			Case III & 2 & 36 & 41 & 37 & 75 & 10 \\ \bottomrule      
	\end{tabular}}
	\vspace{-1em}
	\label{tab:evalsOR}
\end{table}

{\bf Table \ref{tab:evalsMTL}} reports the accuracy, efficiency and run-time of SUHNPF \vs MTL methods for Case~I. For Case II, the {\small \texttt{min\_norm\_solver}} \citep{sener2018multi} used by MTL methods fails, and Case III's constraints are not supported by MTL methods. Note that for fair evaluation, we only consider candidates that are produced within the feasible functional bounds for the problem. Additional run-time evaluation and discussion can be found in \textbf{Appendix I}. 

\begin{table}[bht]
	\centering
	\vspace{-0.75em}
	\caption{\small SUHNPF \vs MTL methods on Case I in finding $P=50$ Pareto points. We report the \% of feasible points each method finds and their avg/max error \vs the true front. Our error measure considers feasible points only; infeasible points are not penalized.}
	%SUHNPF's maximum error is bounded by the user-specified error tolerance parameter $\epsilon=10^{-4}$.}
	\vspace{-1em}
	\resizebox{\columnwidth}{!}{%
		\begin{tabular}{c|cccccc|c}
			\toprule
			\bf Method & LS & \!\!\!\!\!{\footnotesize MOOMTL}\!\!\!\!\! & PMTL\!\! & \!\!EPO\!\! & \!\!EPSE\!\! & \!\!PHN\!\! & \bf \!\!\!{\footnotesize SUHNPF}\!\!\! \\ \midrule 
			%Points Found & 27/50 & 16/50 & 35/50 & 34/50 & 15/50 & 40/50 & 50/50 \\
% 			Evaluations & 5K & 5K & 5K & 5K & 5K & 5K & 4,219 \\      
			Run-time (secs) & 18.1 & 19.2 & 527 & 752 & 641 & 853 & 10.0 \\   
			Points Found\!\! & 54\% & 32\% & 70\% & 68\% & 30\% & 80\% & 100\% \\
			%Evals & 5k & 5k & 5k & 5k & 5k & 5k & 5k \\
			\!\!Avg Err ($10^{-4}$)\!\! & 0.53 & 0.45 & 4.15 & 8.73 & 0.61 & 3.04 & 0.52 \\
			%\!\!Max Err ($10^{-4}$)\!\! & 1.12 & 0.98 & 126.1 & 105.8 & 0.94 & 73.8 & 0.82 \\
			\!\!Max Err ($10^{-4}$)\!\! & 1.12 & 0.98 & 126 & 106 & 0.94 & 73.8 & 0.82 \\ 
			\bottomrule   
%			Runtime (secs) & 18.1 & 19.2 & 527.4 & 752.4 & 641.4 & 852.6 & 10.0 \\\bottomrule      
	\end{tabular}}
	\label{tab:evalsMTL}
	\vspace{-1em}
\end{table}

Regarding Case I coverage and accuracy, SUHNPF returns all $50$ Pareto points; no MTL method does.  For all points that are found, we measure their error \vs the true Pareto front. SUHNPF is seen to achieve the lowest error, with maximum error bounded by the $10^{-4}$ error tolerance parameter set in our experiments. Specifically, the outer loop of Alg.~\ref{alg:fjc} would not achieve convergence until all the points points are within the prescribed error tolerance. In contrast, PMTL, EPO, and PHN yield maximum error two orders of magnitude larger. Note also that our error metric generously scores only the points found by each method, with no penalty for missing points. Visually, SUHNPF  (Fig. \ref{fig:illus-func}) clearly provides better coverage of the Pareto front via a denser, more even spread of points \vs those found by MTL methods (Fig.~\ref{fig:case1-mtl}).
%\vspace{-0.5em}

Because MTL approaches assume convexity of objective functions to generate points with uniformity on the Pareto front, and Case I includes non-convex objectives, the MTL solvers fail to find points in certain regions (see Fig. \ref{fig:case1-mtl}). While EPO's solver has convergence criteria, it still produces points that did not converge (circled in blue). This stems from EPO's assumption on KKT conditions to achieve optimality, which fails on Case I's non-convex form of $f_2$. Correspondingly PHN(-EPO), which uses EPO as its base solver, also fails to converge on certain points. In contrast, 
%SUHNPF returns all $50$ Pareto candidates (Fig. \ref{fig:illus-func}), since the cardinality of set $\mathcal{P}1$ was set to $P=50$, despite the non-convex nature of the cases. S
SUHNPF relies on the FJC to test optimality, which fully supports non-convexity in functions and constraints. %, enabling it to perform faithfully for the non-convex cases.
%\vspace{-0.5em}

Regarding Case I efficiency, SUHNPF is also fastest: nearly twice as fast as LS and MOOMTL, more than 50x faster than PMTL and EPSE, 75x faster than EPO, and 85x faster than PHN. (Because PHN-EPO calls EPO, it is necessarily slower than EPO). As \cite{navon2021learning} note, LS is much faster than EPO, so one could expect PHN-LS to be faster than PHN-EPO and slower than LS.

\vspace{-1em}
\section{SUHNPF as a HyperNetwork}
\vspace{-1em}
% \subsection{Method}
% \vspace{-1em}

Hypernetworks \citep{ha2016hypernetworks} train one neural model to generate effective weights for a second, target model. \citet{navon2021learning} and \citet{lin2021controllable} learn a neural manifold mapping MOO solutions to different target model weights, enabling the target model to achieve the desired Pareto trade-off for the MOO problem.
% \vspace{-0.5em}

Assume the target task maps from input $Y$ to output $Z$. We seek to minimize objective functions $f_1$ and $f_2$ having loss functions $\mathcal{L}_1$ and $\mathcal{L}_2$. Given correct output $Z^*$, we score $Z$ for each loss function  $\mathcal{L}_i(Z,Z^*)$. A target model for this task $C_{\Theta}: Y \rightarrow Z$ with parameters $\Theta$ will yield loss $\forall_i \mathcal{L}_i(C_{\Theta}(Y),Z^*)$. The MOO problem is to find Pareto optimal $\Theta^*$ for the $f_1=\mathcal{L}_1$ \vs $f_2=\mathcal{L}_2$  trade-off. %The input domain is $\Theta \in \mathbb{R}^d$. The loss function Loss$_{Pareto}(\mathcal{L}_1,\mathcal{L}_2)$ on S
%\vspace{-0.25em}

The objectives $\mathcal{L}_{1}(\Theta), \mathcal{L}_{2}(\Theta)$ for SUHNPF are continuous differentiable functions of $\Theta$. This enables SUNHPF's guided double gradient descent strategy to efficiently search the space of model target parameters $\Theta$, mapping each to resulting loss values $(\mathcal{L}_1,\mathcal{L}_2)$. Training data resulting from this search allows SUHNPF to learn an $\epsilon$-bounded approximation $\tilde{M}(\Theta^*)$ to the weak Pareto optimal manifold. 
%\vspace{-0.5em}

As in prior Pareto Front Learning (PFL) work  \citep{navon2021learning,lin2021controllable}, this enables rapid model personalization at run-time based on user preferences.  The neural MOO %classifier loss
Loss$_{classifier}$ is a weighted linear combination of the user-prescribed objectives ($\mathcal{L}_1,\mathcal{L}_2$). The classifier loss hyper-parameter $\alpha$ (trade-off value) is computed as a post-processing step corresponding to Pareto optimal classifier weights $\Theta^*$ for rapid traversal of arbitrary $(\alpha, \Theta^*)$ solutions. See \textbf{Appendix A} for additional details of the setup of SUHNPF as a hypernetwork to optimize a target model. 

\begin{figure*}[ht]
    \centering
    \vspace{-1.5em}
    \begin{subfigure}{0.25\linewidth}
      \centering
      \includegraphics[width=\linewidth]{figs/multimnist.pdf}
      \vspace{-2.0em}
      \caption{\scriptsize MultiMNIST}
    \end{subfigure}
    \quad
    \begin{subfigure}{0.25\linewidth}
      \centering
      \includegraphics[width=\linewidth]{figs/multifashion.pdf}
     \vspace{-2.0em}
      \caption{\scriptsize MultiFashion}
    \end{subfigure}
    \quad
    \begin{subfigure}{0.25\linewidth}
      \centering
      \includegraphics[width=\linewidth]{figs/multifashionmnist.pdf}
      \vspace{-2.0em}
      \caption{\scriptsize MultiFashion+MNIST}
    \end{subfigure}
    \vspace{-1.0em}
    \caption{\small Cross-entropy loss on the test split for all three MTL datasets for SUHNPF \vs PHN. The 11 points shown for each method correspond (from left-to-right) to varying trade-offs preferences in minimizing the combined linear loss over objectives: $\alpha f_1 + (1-\alpha) f_2$ \,for\, $\alpha \in \{1, \,0.9, \ldots,\, 0\}$. The gray dashed-line show the best loss achieved by LeNet to classify a single image for each given task.
    %Although SUHNPF and PHN both learn Pareto manifolds over training data, the true Pareto front on the test data may naturally differ from that found on training data. Solid blue and red lines connecting test data Pareto points in Fig. \ref{fig:mnist_err} are included only for visualization purposes and should not be mistaken for the actual  manifold learned on training data. 
    }
    \vspace{-1.0em}
    \label{fig:mnist_err}
\end{figure*}

\vspace{-1em}
\subsection{Evaluation on Multi-Task Learning}
\vspace{-1em}

We evaluate on the same MTL image classification problems as in \citet{navon2021learning}. Given two underlying source datasets, MNIST \citep{lecun1998gradient} and Fashion-MNIST \citep{xiao2017fashion}, \citet{navon2021learning} report on three MTL tasks: MultiMNIST \citep{sabour2017dynamic}, Multi-Fashion, and Multi-Fashion + MNIST. In each case, two images are sampled from source datasets and overlaid, one at the top-left corner and one at the bottom-right, with each also shifted up to 4 pixels in each direction. The two competing tasks are to correctly classify each of the original images: Top-Left (Task 1 or $f_1$) and Bottom-Right (Task 2 or $f_2$). We use $120$K training and $20k$ testing examples and directly apply existing single-task models, allocating $10\%$ of each training set for constructing validation sets, as used in \cite{lin2019pareto}. \cite{navon2021learning} found that PHN-EPO (henceforth PHN) was more accurate than other methods they compared, so we use PHN as our baseline.
%

We adopt the LeNet architecture \citep{lecun1998gradient} as the target model to learn. Following prior MTL work \citep{sener2018multi}, we treat all layers other than the last as the shared representation function and put two fully-connected layers as task-specific functions. We use cross-entropy loss with softmax activation for both task-specific loss functions. Because cross-entropy loss functions are differentiable, we can use them directly as training objectives.

   
{\bf Results.} We see SUHNPF \vs PHN results on dataset test splits in \textbf{Fig. \ref{fig:mnist_err}}. Because SUHNPF defines a strict $\epsilon$-bound on error, we can assert its correctness on this basis alone. Visual inspection also shows that PHN returns dominated points (\eg top of MultiMNIST plot), whereas a Pareto front by definition includes only non-dominated points.  Nonetheless, we cannot directly measure error \vs a known Pareto front because real MOO problems lack a simple analytical solution like synthetic benchmark problems. Of course, we can still compare relative performance of methods. We see that {\em SUHNPF achieves strictly lower loss than PHN across all user trade-off settings of $\alpha$ on all three datasets}. 

Since the minimum loss $\textnormal{min}(f_1)$=$\textnormal{min}(f_2)$=0, for both objectives, the ideal point \citep{marler2004survey} for joint minimization is $(0,0)$. A simple error measure for each point found is thus its $L2$ distance from $(0,0)$: $\sqrt{f_1^2 + f_2^2}$. \textbf{Table~\ref{tab:mtlalphas}} reports this distance for each Pareto point found at each $\alpha$ (across methods and datasets). We also report the average over the 11 settings of $\alpha$. Overall, Table \ref{tab:mtlalphas} quantifies what Fig. \ref{fig:mnist_err} depicts visually: SUHNPF performs strictly better for every Pareto point and thus also on average. 

\begin{table}[ht]
    \centering
    \vspace{-0.5em}
    \caption{\small SUHNPF \vs PHN on MTL tasks, measured by distance of each Pareto point found \vs the ideal loss point $(f_1,f_2)=(0,0)$.}
    \vspace{-1em}
    \resizebox{1.02\columnwidth}{!}{%
    \begin{tabular}{c|ccccccccccc|c} \toprule
        & \multicolumn{12}{c}{Trade-off values $\alpha$}\\ 
        \!\!Method\!\! & 0.0 & 0.1 & 0.2 & 0.3 & 0.4 & 0.5 & 0.6 & 0.7 & 0.8 & 0.9 & 1.0 & {\bf Avg} \\ \midrule
        \multicolumn{13}{c}{MultiMNIST}\\ 
        PHN & .621 & .585 & .539 & .504 & .486 & .478 & .483 & .494 & .508 & .521 & .527 & {\bf .522} \\
        \!\!\!\!\!{\footnotesize SUHNPF}\!\!\!\!\! & .500 & .478 & .464 & .448 & .441 & .434 & .441 & .443 & .452 & .457 & .465 & {\bf .456} \\ \midrule
        \multicolumn{13}{c}{MultiFashion}\\ 
        PHN & .877 & .872 & .853 & .813 & .784 & .773 & .779 & .797 & .816 & .826 & .829 & {\bf .819} \\
        \!\!\!\!\!{\footnotesize SUHNPF}\!\!\!\!\! & .862 & .819 & .792 & .773 & .757 & .746 & .754 & .758 & .767 & .793 & .810 & {\bf .784} \\ \midrule
        \multicolumn{13}{c}{MultiFashion+MNIST}\\ 
        PHN & .690 & .613 & .581 & .569 & .571 & .579 & .598 & .631 & .682 & .752 & .797 & {\bf .642} \\
        \!\!\!\!\!{\footnotesize SUHNPF}\!\!\!\!\! & .667 & .617 & .586 & .552 & .547 & .543 & .549 & .553 & .583 & .629 & .695 & {\bf .593} \\ \bottomrule
    \end{tabular}}
    \vspace{-1em}
    \label{tab:mtlalphas}
\end{table}




\vspace{-1.0em}
\section{Understanding SUHNPF \vs PHN} \label{sec:comp}
\vspace{-1.0em}
While both SUHNPF and PHN %\cite{navon2021learning} 
are manifold-based, % (\textbf{Fig. \ref{fig:hnpfvsphn}}), 
they differ in the type of manifold being learned. SUHNPF explicity maintains point sets $\mathcal{P}0$ and $\mathcal{P}1$ to learn the classification boundary between Pareto \vs non-Pareto points as per the FJC. PHN fits a regression surface over the set of points returned by LS or EPO. Since neither LS nor EPO are guaranteed to operate under non-convex settings (Section \ref{sec:related}), those drawbacks are in turn inherited by PHN in using them. \textbf{Table \ref{tab:hnpfvsphn}} highlights the key differences. The distinction between a diffusive full-rank indicator \vs a low-rank regressor is further discussed in \textbf{Appendix B}. 

\begin{table}[ht]
    \centering
    %\scriptsize
    %\vspace{-0.5em}
    \caption{\small SUHNPF \vs PHN for Pareto front learning.}
    \vspace{-1.0em}
    \resizebox{\columnwidth}{!}{%
    \begin{tabular}{l|c|c}
    \toprule
        \bf Criteria & \textbf{SUHNPF} & \textbf{PHN} \\
        \midrule
        Handle non-convexity\!\! & \cmark & \xmark \\
        Supports constraints & \cmark & \xmark \\
        Manifold Extractor & \cmark & \cmark \\
        Nature of manifold & Diffusive full-rank indicator & Low-rank regressor\!\! \\
       Optimality Criteria & Fritz-John Conditions  & EPO solver \\
         \bottomrule
    \end{tabular}}
    \label{tab:hnpfvsphn}
    \vspace{-1em}
\end{table}

\vspace{-1em}
\section{Conclusion}
\vspace{-1em}

Multi-objective optimization problems require balancing competing objectives, often under constraints. In this work, we described a novel method for {\em Pareto-front learning} (inducing the full Pareto manifold at train-time so users can pick any desired optimal trade-off point at run-time). Our SUHNPF Pareto solver is robust against non-convexity, with error bounded by a user-specified tolerance. Our key innovation over prior work's HNPF \citep{singh2021hybrid} is to exploit Fritz-John Conditions for a novel guided {\em double gradient descent} strategy. The scaling property imparts significant improvement in memory and run-time \vs prior OR and Multi-Task Learning (MTL) approaches. Results across synthetic benchmarks and MTL problems in image classification show clear, consistent advantages of SUHNPF in capability (handling non-convexity and constraints), denser coverage and higher accuracy in recovering the true Pareto front, and efficiency (time and space). Beyond empirical results, our conceptual framing and review of prior work also further bridges disparate lines of OR and MTL research.

Both SUHNPF and MTL methods assume differentiable evaluation metrics as training loss so optima can be found through gradient descent. However, loss can be %there is increasing interest in other evaluation measures in which the loss is a 
a non-differentiable, probabilistic measure, such as in fairness-related tasks \citep{sacharidis2019top,valdivia2021fair}. This creates a risk of metric divergence between training loss \vs the evaluation measure of interest \citep{abou2012note}. Continuing development of differentiable measures can help to address this \citep{swezey2021pirank}.


{\bf Acknowledgments}. We thank the reviewers for their valuable feedback. This research was supported in part by Wipro and by Good Systems\footnote{\scriptsize\url{http://goodsystems.utexas.edu/}}, a UT Austin Grand Challenge to develop responsible AI technologies. The statements made herein are solely the opinions of the authors. % and not the views of the sponsoring agencies.
\clearpage

\balance
\bibliography{References}

\end{document}
