%\documentclass{uai2023} % for initial submission
\documentclass[accepted]{uai2023} % after acceptance, for a revised
% version; also before submission to
% see how the non-anonymous paper
% would look like

%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2023} % ptmx math instead of Computer
% Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2023} % newtx fonts (improves upon
 % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
%\usepackage[american]{babel}
 \usepackage[british]{babel}

\usepackage[table]{xcolor}
\usepackage{bm}
\usepackage{amsfonts,amsthm, amssymb}
\usepackage{mathrsfs}  
\usepackage{array}
\usepackage{mathptmx}
\usepackage{xcolor, soul}
\usepackage{graphicx}
\usepackage{multirow}
 
\usepackage{bbm}
\usepackage{siunitx}
\usepackage{color}        
\usepackage{enumitem}
\setitemize{noitemsep,topsep=0pt,parsep=0pt,partopsep=0pt}
\usepackage[belowskip=-4pt,aboveskip=0pt]{caption}

\usepackage{multirow}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

% for cross referencing the main text
% PLEASE ONLY USE xr IN THE SUPPLEMENTARY MATERIAL. 
% In the main paper, hard code any cross-reference to the supplementary material. 
\usepackage{xr} 
\externaldocument{benavoli_39}

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example
\newcommand{\Qset}{\mathcal{Q}}
\newcommand{\bx}{{\bf x}}



\setlength{\intextsep}{10pt plus 2pt minus 2pt}
\usepackage{hhline}
\usepackage{float} 






\newtheorem{example}{Example} 
\newtheorem{theorem}{Theorem}  
% \newtheorem{proposition}[theorem]{Lemma} 
\newtheorem{proposition}[theorem]{Proposition} 
\newtheorem{remark}[theorem]{Remark}
\newtheorem{corollary}[theorem]{Corollary}
\newtheorem{definition}[theorem]{Definition}
\newtheorem{conjecture}[theorem]{Conjecture}
\newtheorem{axiom}[theorem]{Axiom}



\title{Learning Choice Functions with Gaussian Processes \\(Supplementary Material)}

% The standard author block has changed for UAI 2023 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{\href{mailto:<alessio.benavoli@tcd.ie>?Subject=Your UAI 2023 paper}{Alessio~Benavoli}{}}
\author[2]{Dario~Azzimonti}
\author[2]{Dario~Piga}

% Add affiliations after the authors
\affil[1]{%
    School of Computer Science and Statistics\\
    Trinity College Dublin, Ireland
}
\affil[2]{%
    Dalle Molle Institute for Artificial Intelligence (IDSIA)\\
    USI/SUPSI\\
    Lugano, Switzerland
}
  
  \begin{document}
  
\onecolumn %% Turn this off if single column is desired for the supplement
\maketitle




 \section{Vectorisation of the likelihood}
 \label{app:like}
 The product in the second row in main-text-\eqref{eq:likelihoodexpanse0} is a probabilistic relaxation of main-text-\eqref{eq:likcondpareto2}. Since it always involves comparisons between pairs of objects, it can be vectorized as follows:
   \begin{equation}
  \label{eq:likelihoodexpanseapp0}
 \begin{aligned}
      &\prod_{k=1}^m\prod_{\{{\bf o},{\bf v}\} \in C_\sharp(A_k)}\left( 1-\prod_{i=1}^d \Phi\left(\frac{u_i({\bf o})-u_i({\bf v})}{\sigma}\right)-\prod_{i=1}^d \Phi\left(\frac{u_i({\bf v})-u_i({\bf o})}{\sigma}\right)\right)
   = \prod_{{\bf a}_i \in \mathcal{A}}\left( 1- \Phi_d\left(\frac{{\bf a}_i{\bf u}(X)}{\sigma}\right)-\Phi_d\left(\frac{-{\bf a}_i{\bf u}(X)}{\sigma}\right)\right),\\
   \end{aligned}
 \end{equation}
 where there is a vector ${\bf a}_k \in \mathbb{R}^{1 \times t}$ for each pairs $\{{\bf x}_i,{\bf x}_j\} \in C(A_k)$  with ${\bf x}_i\neq {\bf x}_j$. ${\bf a}_k$ is a zero vector whose $i$-th and $j$-th elements are equal to $1$ and respectively, $-1$, and $\Phi_d$ is the CDF of d-dimensional standard multivariate Gaussian distribution. 
      
 The product in the last row in main-text-\eqref{eq:likelihoodexpanse0} is a probabilistic relaxation of main-text-\eqref{eq:likcondpareto1}. It cannot be easily vectorized because $\prod_{{\bf o} \in C(A_k)}$ has a varying number of terms depending on $k$.   To overcome this issue we assume, as usually done in decision theory (see for instance \citep[Sec.3.4.2]{parmigiani2009decision}), the existence a worst object $\boldsymbol{\omega}\in \mathcal{X}$, that is an object such that $u_i(\boldsymbol{\omega})=-\infty$ for each $i=1,\dots,d$.
 This allows us to compare any ${\bf v} \in C(A_k)$ with the same number of elements (either ${\bf o} \in C(A_k)$ or $\boldsymbol{\omega}$).
 
 Assume for instance that $|A_k|=5$, $R(A_k)=\{{\bf v}_1,{\bf v}_2,{\bf v}_3\}$ and $C(A_k)=\{{\bf o}_1,{\bf o}_2\}$, then the  product in the last row in main-text-\eqref{eq:likelihoodexpanse0}
    \begin{equation}
  \label{eq:likelihoodexpanseapp1}
 \begin{aligned}
      &\prod_{{\bf o} \in C(A_k)} \left(1- \prod_{i=1}^d \Phi\left(\frac{u_i({\bf o})-u_i({\bf v}_1)}{\sigma}\right)\right)\prod_{{\bf o} \in C(A_k)} \left(1- \prod_{i=1}^d \Phi\left(\frac{u_i({\bf o})-u_i({\bf v}_2)}{\sigma}\right)\right)\prod_{{\bf o} \in C(A_k)} \left(1- \prod_{i=1}^d \Phi\left(\frac{u_i({\bf o})-u_i({\bf v}_1)}{\sigma}\right)\right)\\
   \end{aligned}
 \end{equation}
For each ${\bf v}_j$, we can write each product as
    \begin{equation}
  \label{eq:likelihoodexpanseapp2}
 \begin{aligned}
      &\prod_{{\bf o} \in C(A_k)} \left(1- \prod_{i=1}^d \Phi\left(\frac{u_i({\bf o})-u_i({\bf v}_j)}{\sigma}\right)\right)
      = \left(1- \prod_{i=1}^d \Phi\left(\frac{u_i({\bf o}_1)-u_i({\bf v}_j)}{\sigma}\right)\right) \left(1- \prod_{i=1}^d \Phi\left(\frac{u_i({\bf o}_2)-u_i({\bf v}_j)}{\sigma}\right)\right)\\
       =& \left(1- \prod_{i=1}^d \Phi\left(\frac{u_i({\bf o}_1)-u_i({\bf v}_j)}{\sigma}\right)\right) \left(1- \prod_{i=1}^d \Phi\left(\frac{u_i({\bf o}_2)-u_i({\bf v}_j)}{\sigma}\right)\right)\left(1- \prod_{i=1}^d \Phi\left(\frac{u_i(\boldsymbol{\omega})-u_i({\bf v}_j)}{\sigma}\right)\right)^3\\
%       &N({\bf v}_{j};0,\sigma^2\id_d)d{\bf v}_{j}\Bigg)\\
   =& \prod_{{\bf b}_i \in \mathcal{B}}\left( 1- \Phi_d\left(\frac{{\bf b}_i{\bf u}(\tilde{X})}{\sigma}\right)\right),\\
   \end{aligned}
 \end{equation}
 where $\boldsymbol{\omega}\in \mathcal{X}$ such that $u_i(\boldsymbol{\omega}) = -\infty$ for each $i=1,\ldots, d$ and $\tilde{X}=[X,\boldsymbol{\omega}]^\top$. ${\bf b}_i \in \mathbb{R}^{1 \times (t+1)}$
 for each compared pairs $\{{\bf x}_i,{\bf x}_j\}$  with ${\bf x}_i\neq {\bf x}_j$, is a zero vector whose $i$-th and $j$-th elements are equal to $1$ and respectively, $-1$,
 
 \section{Label switching problem}
 \label{app:LA} 
The Laplace Approximation (LA) cannot be applied due to the so-called `label switching' problem, which is caused by symmetry in the likelihood: any permutation of the labels $i=1,\dots,d$ yields the same likelihood. For this reason, the Hessian of the log-likelihood w.r.t.\ ${\bf u}(X)$ is in general an indefinite matrix and LA is not well-defined.

Consider for instance the case where $d=2$ 
and $C(A_1)=\{\bx_a,\bx_b\}$ is the only choice data we have, the likelihood is 
$$
L = \left(1- \prod_{i=1}^2 \Phi\left(\frac{u_i({\bf x}_a)-u_i({\bf x}_b)}{\sigma}\right)- \prod_{i=1}^2 \Phi\left(\frac{u_i({\bf x}_b)-u_i({\bf x}_a)}{\sigma}\right)\right)
$$
and it is symmetric to the switching of $u_1$ and $u_2$.

The Variational Approximation is also affected by this problem, but it is well-defined. It will simply converge to one of the symmetric (to label switching) components of the distribution.

 
 
  \section{Variational Inference}
 \label{app:VI} 
 We implemented our model using automatic-differentiation in Jax   \citep{jax2018github}. 
 
 For the Variational Approximation (VA), we use the implementation in \citep{opper2009variational}, which has $2t$ parameters with $t=|X|$. In particular, we consider the covariance matrix in \citep[Equation (10)]{opper2009variational}. This means we only need $t$ parameters for the covariance matrix of the VA distribution. This is an approximation, but it allows us reduce the computational load of ChoiceGP, which is composed by $d$ GPs.
 
Indeed, by exploiting the above parametrisation and the factorised prior main-text-\eqref{eq:prior}, our ChoiceGP model can be implemented efficiently. We need
 storing and inverting $d$  kernel matrices with dimension $t \times t$. 
 
 
We initialise the Variational Approximation  with MAP estimate and then perform 5000 iterations.



 \section{Interpretation of the probit likelihood}
  \label{app:batch}
 There are two ways to interpret the likelihood main-text-\eqref{eq:likelcdf0}:
 \begin{enumerate}
  \item \textit{Limit of discernibility:} Alice may make mistakes when comparing two objects ${\bf x}_i,{\bf x}_j$ whose difference in utility is small (e.g., errors are inversely proportional to the difference between the two utilities $|u({\bf x}_i)-u({\bf x}_j)|$). 
  \item \textit{Noise:} the observed utility function differs from
the true utility function due to disturbances
(e.g., $o({\bf x}_i)=u({\bf x}_i)+\text{noise}$).
 \end{enumerate}
 In this second case, it is well known that
\begin{equation}
    \label{eq:likelcdf0aa}
p({\bf x}_i \succ {\bf x}_j|u)=\Phi\left(\frac{u({\bf x}_i)-u({\bf x}_j)}{\sqrt{2}\sigma}\right)=\int I_{u({\bf x}_i)+w_i-u({\bf x}_j)-w_j>0}N(w_i;0,\sigma^2)N(w_j;0,\sigma^2)dw_i dw_j.
\end{equation}
 
 There is no  correct interpretation -- it depends on the  ``error-model'' we assume to account for  the inconsistencies in the subject's preferences. 
 
 For instance, for the computer example in Section \ref{sec:intro}, assuming that inconsistencies  are due to a Gaussian noise model does not make much sense. The features of the computer are observed exactly (without any noise). Instead, it is reasonable to assume that two different computers,  which only have slightly different characteristics, are indiscernible for Alice. For this reason, she may state inconsistent preferences when comparing them.
 
 Similarly, there may be cases where the utility is observed through a noisy measurement and, therefore, the second interpretation is more correct in this case.
 
 The issue arises when we compare the same objects multiple times. Assuming that Alice chooses ${\bf o}$ and discards the elements in $R(A_k)$, this leads to the following batch-likelihood  
 \begin{equation}
  \label{eq:like1}
   \begin{aligned}
      \int &\left(\prod_{{\bf v} \in R(A_k)}\Phi\left(\frac{u({\bf o})+w_k-u({\bf v})}{\sigma}\right)\right)N(w_k;0,\sigma^2)dw_k,\\
   \end{aligned}
 \end{equation}
 for the case $d=1$ (single utility) and noise model, which is different from the limit-of-discernibility error-model
 \begin{equation}
  \label{eq:like2}
 \begin{aligned}
       \prod_{{\bf v} \in R(A_k)}\Phi\left(\frac{u({\bf o})-u({\bf v})}{\sigma}\right).\\
   \end{aligned}
 \end{equation}
 As stated in Proposition \ref{prop:comparison} (see proof below), \eqref{eq:like2} is a lower bound for  \eqref{eq:like1}. This means that either
 \begin{itemize}
  \item assuming \eqref{eq:like1} when \eqref{eq:like2} is the true error-model, or
  \item assuming \eqref{eq:like2} when \eqref{eq:like1} is the true error-model
 \end{itemize}
 may lead to a biased posterior. We will further investigate the difference between these two models in future work.
 
 
 \begin{proposition}
 \label{prop:comparison}
  The  likelihood main-text-\eqref{eq:likelihood1} is a lower bound of the batch-preference likelihood:
    \begin{equation}
  \label{eq:likelihoodbatch00}
 \begin{aligned}
       \int &\left(\prod_{{\bf v} \in R(A_k)}\Phi\left(\frac{u({\bf o})+w_k-u({\bf v})}{\sigma}\right)\right)N(w_k;0,\sigma^2)dw_k.\\
   \end{aligned}
 \end{equation}
 \end{proposition}
% Therefore, the likelihood in \eqref{eq:likelihoodexpanse0} encompasses both preference and batch-preference models.
 

\begin{proof}
We are going to use the following results. \\
  
\textbf{Result:}  If ${\bf v}=[v_1,\dots,v_d]$ are independent, then for any increasing functions $h$ and $g$ of $n$ variables:
 $$
 E[h({\bf v})g({\bf v})]\geq E[h({\bf v})]E[g({\bf v})].
 $$
 The proof can be found in \cite[Sec.9.9]{ROSS2013153}     
   
  Consider the  likelihood \eqref{eq:likelihoodbatch00} 
$$
 \begin{aligned}
     \int &\left(\prod_{{\bf v} \in R(A_k)}\Phi\left(\frac{u({\bf o})+w_k-u({\bf v})}{\sigma}\right)\right)N(w_k;0,\sigma^2)dw_k.\\
   \end{aligned}
$$    
and note that inside the parenthesis we have a product of monotone increasing functions in $w_k$. Therefore, we can exploit the above result to derive that
$$
 \begin{aligned}
      & \int \left(\prod_{{\bf v} \in R(A_k)}\Phi\left(\frac{u({\bf o})+w_k-u({\bf v})}{\sigma}\right)\right)N(w_k;0,\sigma^2)dw_k\geq  \prod_{{\bf v} \in R(A_k)}\Phi\left(\frac{u({\bf o})-u({\bf v})}{\sigma}\right). 
   \end{aligned}
$$   
 
\end{proof}

\section{ChoiceNN vs.\ ChoiceGP}
\label{sec:counterexample}

We illustrate the issue with ChoiceNN considering the 1D utility function $u(x)=\cos(5x)+\exp\left(-\tfrac{x^2}{8}\right)$ with $x \in [-2.6,2.6]$
in Figure \ref{fig:trueuNN}.
    
\begin{figure}[h]
\centering
 \begin{tabular}{c}
\includegraphics[height=3cm]{True_cos.pdf}
                                                                                                                                       \end{tabular}
	\caption{True utility function}
	\label{fig:trueuNN}
\end{figure}



We used $u$ to generate choice data.  We sampled $150$ inputs $x_i$ at random in $ [-2.6,2.6]$. We then generated  $m=500$ random subsets $\{A_k\}_{k=1}^m$ of the 500 points each one of size $|A_k|=2$  and computed the corresponding choice pairs $(C(A_k),A_k)$ based on  $u$. 


We ran ChoiceNN with fixed  latent dimension $d=1$
The learned utility function is shown in Figure \ref{fig:appdiff} (a), which is reasonably consistent with the true utility.


We then ran ChoiceNN with fixed  latent dimension $d=2$ (the true latent dimension is one) and reported the estimated utility functions, for two different random initialisation of the parameters of the NN, in Figure \ref{fig:appdiff} (b) and, respectively (c). It can be noticed that the model  converged to two different local optima. In both cases, the learned utility functions are not Pareto-consistent with the choice data. In other words, the model is not able to find a utility representation of the choice data and, therefore, it is not able to make correct predictions.
We have tried different NN architectures (number of layers and number of nodes) as well as different values of the hyperparameters for ChoiceNN, but the issue remains.

 The disadvantage of a nonlinear parametric method, like ChoiceNN, is the fact that the latent utility functions depend nonlinearly on the parameters. Instead, in ChoiceGP, the utility functions (at the training data) are part of the  the variational parameters and, therefore, can be more easily optimised to  satisfy the Pareto-consistency implied by the choice data.


 Figure \ref{fig:trueChoiceGP} shows the utilities learned by ChoiceGP with $d=2$. They coincide.
 This shows that ChoiceGP is able to easily understand that the true latent dimension is one.
 Moreover, the learned utility basically coincides with the true utility in Figure \ref{fig:trueuNN}
 (apart from a scaling factor, which cannot be estimated from the data).
 
 

  
\begin{figure*}[h]
	\centering
	\begin{tabular}{ccc}
		\includegraphics[height=3.8cm,trim={0.0cm 0.0cm 1.5cm 0.0cm }, clip]{ChoiceNN_dim1.pdf} &
		\includegraphics[height=3.8cm,trim={0.0cm 0.0cm 1.5cm 0.0cm }, clip]{ChoiceNN_dim2.pdf} &
		\includegraphics[height=3.8cm,trim={0.0cm 0.0cm 1.5cm 0.0cm }, clip]{ChoiceNN_dim2_b.pdf} \\
		(a) & (b) & (c)\\
	\end{tabular}  
%	\rowcolors{2}{green!6}{white}  
	\caption{Learned utilities via ChoiceNN}
	\label{fig:appdiff}
\end{figure*}

\begin{figure}[h]
\centering
 \begin{tabular}{c}
\includegraphics[height=3.8cm]{ChoiceGP_cos.pdf}
\end{tabular}
	\caption{Learned utilities via ChoiceGP}
	\label{fig:trueChoiceGP}
\end{figure}

\clearpage 

 \section{Real-datasets}
 \label{app:real}
  Table \ref{tab:charac} displays the characteristics of the considered datasets. 
  
\begin{table}[H]
		\begin{center}
			{\small
			   \scalebox{0.8}{
				\begin{tabular}{lccc}
					\hline
					{\bf Dataset}  & {\bf \#Features} & {\bf \#Outputs} \\
					\hline
					AM &  6 & 3 \\
					EDM &  4 & 3 \\
					%energy  & 400 & 6 & 2 \\
					jura &  6 & 3 \\
                    slump  & 7 & 3 \\
                    vehicle  & 5 & 3 \\
					\hline
				\end{tabular}}
			}
		\end{center}   
		\caption{Characteristics of the datasets.}
		\label{tab:charac}
	\end{table}
	
The first 4 datasets are standard datasets used in multi-target regression. The ``vehicle dataset'' has been obtained from the  Vehicle Safety model\footnote{A model that determines the thickness of five reinforced components of a vehicle's frontal frame \citep{yang2005metamodeling}} using a latin-hypercube design of experiment.  We have included the datasets in our repository together with the code to replicate the experiments. 

We have implemented GPGP, PGP and PairGP in GPy \citep{gpy2014}. For ChoiceNN, we use the implementation provided by the authors  \cite{pfannschmidt2020learning}.  

As shown in the average accuracy table in Section \ref{sec:real}, ChoiceGP has a higher average accuracy than the other models. This claim is also supported by a statistical analysis as we will show hereafter. 

We have compared ChoiceGP against PGP, GPGP and PairGP for the majority-rule using the pairwise Bayesian hierarchical hypothesis testing model \citep{corani2017statistical}. The test accounts for the correlation between the paired differences of accuracy due to  the overlapping training
sets built during cross-validation. This test  declares two models practically equivalent when the difference of accuracy is less than 0.01 (1\%). The interval $[-0.01, 0.01]$ thus
defines a region of practical equivalence (rope) for the performance of the models. For instance for the pair (ChoiceGP,PGP), the test returns the posterior samples of the  probability vector  $[p(ChoiceGP > PGP), p(ChoiceGP \approx PGP), p(ChoiceGP < PGP)]$ and,
therefore, this posterior can be visualised in the probability simplex (Figure \ref{fig:hierarch}). 
For all the pairwise comparisons,  it can seen that the vast majority of the samples are in the region at the right bottom of the
triangle. This confirms that ChoiceGP is practically significantly better than the other three methods.
 Note that, we have only statistically compared the methods in the majority-rule scenario, because the differences are even larger in the random scenario.

    
\begin{figure}[h]
\centering  
 \includegraphics[width=5cm]{ChoiceGP_PGP.pdf}
  \includegraphics[width=5cm]{ChoiceGP_GPGP.pdf}
   \includegraphics[width=5cm]{ChoiceGP_PairGP.pdf}
   \caption{Posterior samples for the pairwise tests ChoiceGP vs. PGP, GPGP and, respectively, PairGP. This confirms that ChoiceGP is practically significantly better than the other three methods.}
   \label{fig:hierarch}
\end{figure}

\bibliography{biblio}


\end{document}
