\appendix

\section{Linear Performance Metric Elicitation (LPME)}
\label{append:sec:slme}

In this section, we shed more light on the procedure from~\citep{hiranandani2019multiclass} that elicits a multiclass linear metric. We call it the Linear Performance Metric Elicitation (LPME) procedure.  As discussed in Algorithm~1, we use this as a subroutine to elicit metrics in the quadratic family. 

LPME exploits the enclosed sphere $\Scal \subset \Rcal$ for eliciting linear multiclass metrics. Let the sphere $\Scal$'s radius be $\rho>0$, and the oracle's scale invariant metric be $\phi^{\text{lin}}(\rmbf) \coloneqq \inner{\ambf}{\rmbf}$ such that $\Vert \ambf \Vert_2=1$. The oracle queries are $\tiny{\Omega\left( \rmbf_1, \rmbf_2 \,;\, \phi^{\text{lin}} \right) \coloneqq \1[\phi^{\text{lin}}(\rmbf_1) > \phi^{\text{lin}}(\rmbf_2)]}$. We first outline a trivial Lemma from~\citep{hiranandani2019multiclass}.


\blemma~\citep{hiranandani2019multiclass}
Let a normalized vector $\ambf$ with $\Vert \ambf \Vert_2 =1$ parametrize a linear metric $\phi^{\text{lin}} \coloneqq \inner{\ambf}{\rmbf}$, then the unique optimal rate $\rmbfbar$ over $\Scal$ is a rate on the boundary of $\Scal$ given by $\rmbfbar = \rho \ambf +\ombf$, where $\ombf$  is the center of  $\Scal$. 
\label{lem:spherebayes}
\elemma
\vskip -0.2cm

\addtocounter{algorithm}{1}
\balgorithm[H]
\caption{Linear Performance Metric Elicitation}
\label{alg:slme}
\small
\balgorithmic[1]
%\setcounter{AlgoLine}{1}
\STATE \textbf{Input:} Query space $\Scal \subset \Rcal$, binary-search tolerance $\epsilon > 0$, oracle $\Omega(\cdot, \cdot\,;\, \phi^{\text{lin}})$ with metric $\phi^{\text{lin}}$\\ \hfill\\
\FOR{$i = 1, 2, \cdots k$} 
\STATE Set $\ambf = \ambf' = (1/\sqrt{k}, \dots, 1/\sqrt{k})$.
\STATE Set $a'_i = -1/\sqrt{k}$.
\STATE Compute the optimal $\sbar^{(\ambf)}$ and $\sbar^{(\ambf')}$ over the sphere $\Scal$ using Lemma~\ref{lem:spherebayes}
\STATE Query $\Omega(\smbfbar^{(\ambf)}, \smbfbar^{(\ambf')} ; \phi^{\text{lin}})$\\
\ENDFOR
\COMMENT{These queries reveal the search orthant}\\ \hfill \\
\STATE Start with coordinate $j=1$.
\STATE\textbf{Initialize:} $\bm{\theta} = \bm{\theta}^{(1)}$ \hfill \COMMENT{$\bm{\theta}^{(1)}$ is a point in the search orthant.}
\FOR{$t=1, 2, \cdots, T=3(k-1)$}
\STATE Set $\bm{\theta}^{(a)} = \bm{\theta}^{(c)}=\bm{\theta}^{(d)}=\bm{\theta}^{(e)}=\bm{\theta}^{(b)} = \bm{\theta}^{(t)}$.\\
\STATE Set $\theta_j^{(a)}$ and $\theta_j^{(b)}$ to be the min and max angle, respectively, based on the search orthant
\WHILE{$\abs{\theta^{(b)}_j - \theta^{(a)}_j} > \epsilon$}
\STATE Set $\theta^{(c)}_j = \frac{3 \theta^{(a)}_j + \theta^{(b)}_j}{4}$, $\theta^{(d)}_j = \frac{\theta^{(a)}_j + \theta^{(b)}_j}{2}$, and $\theta^{(e)}_j = \frac{\theta^{(a)}_j + 3 \theta^{(b)}_j}{4}$.
\STATE Set $\rmbfbar^{(a)} = \mu(\bm{\theta}^{(a)})$ (i.e. parametrization of $\partial \Scal$). Similarly, set $\rmbfbar^{(c)}, \rmbfbar^{(d)}, \rmbfbar^{(e)}, \rmbfbar^{(b)}$
\STATE $[\theta^{(a)}_j, \theta^{(b)}_j] \leftarrow$ \emph{ShrinkInterval} ($\Omega, \rmbfbar^{(a)},\rmbfbar^{(c)},\rmbfbar^{(d)},\rmbfbar^{(e)},\rmbfbar^{(b)}$)\hfill \COMMENT{see Figure~\ref{append:fig:shrink1}}
\ENDWHILE
\STATE Set $\theta^{(d)}_j = \frac{1}{2}(\theta^{(a)}_j+\theta^{(b)}_j)$ \\
\STATE Set $\bm{\theta}^{(t)} = \bm{\theta}^{(d)}$.
\STATE Update coordinate $j \leftarrow j + 1$ cyclically. 
\ENDFOR
\STATE \textbf{Output:} $\hat a_i =\Pi_{j=1}^{i-1} \sin\theta_j^{(T)} \cos{\theta_i}^{(T)} \, \forall i \in [k-1],\;\hat a_k =\Pi_{j=1}^{k-1} \sin\theta_j^{(T)}$
\ealgorithmic
\ealgorithm

Lemma~\ref{lem:spherebayes} provides a way to define a one-to-one correspondence between a  linear performance metric and its optimal rate over the sphere. That is, given a linear performance metric, using Lemma~\ref{lem:spherebayes}, we may get a unique point in the query space lying on the boundary of the sphere $\partial\Scal$. Moreover, the converse is also true; i.e., given a feasible rate on the boundary of the sphere $\partial\Scal$, one may recover the linear metric for which the given rate is optimal. Thus, for eliciting a linear metric,~\cite{hiranandani2019multiclass} essentially search for the optimal rate (over the sphere $\Scal$) using pairwise queries to the oracle. The optimal rate by virtue of Lemma~$\ref{lem:spherebayes}$ reveals the true metric. The LPME subroutine is summarized in Algorithm~\ref{alg:slme}. Intuitively, Algorithm~\ref{alg:slme} minimizes a strongly convex function denoting distance of query points from a supporting hyperplane whose slope is the true metric (see Figure~2(c) in~\citep{hiranandani2019multiclass}). The procedure also uses the following standard paramterization for the surface of the sphere $\partial\Scal$. 

\textbf{Parameterizing the boundary of the enclosed sphere $\partial \Scal$.} 
Let $\thetambf$ be a ($k-1$)-dimensional vector of angles. In $\thetambf$, all the angles except the primary angle are in $[0, \pi]$, and the primary angle is in $[0, 2\pi]$. A scale invariant linear performance metric with $\Vert \ambf \Vert_2=1$ can be constructed by assigning $a_i = \Pi_{j=1}^{i-1} \sin\theta_j \cos{\theta_i}$ for $i \in [k-1]$ and $a_k = \Pi_{j=1}^{k-1} \sin\theta_j$. Since we can easily compute the metric's optimal rate over $\Scal$ using Lemma~\ref{lem:spherebayes}, by varying $\thetambf$ in this procedure, we parametrize the surface of the sphere $\partial\Scal$. We denote this parametrization by $\mu(\thetambf)$, where $\mu: [0, \pi]^{k-2} \times [0, 2\pi] \to \partial \Scal$.

\emph{Description of Algorithm~\ref{alg:slme}:} Let the oracle's metric be $\phi^{\text{lin}} = \inner{\ambf}{\rmbf}$ such that $\Vert \ambf \Vert_2=1$ (Section~\ref{ssec:mpme}). Using the parametrization $\mu(\thetambf)$ for the boundary of the sphere $\partial \Scal$,  Algorithm~\ref{alg:slme} returns an estimate $\ambfhat$ with $ \Vert \ambfhat \Vert_2=1$. Line 2-6 recover the search orthant of the optimal rate over the sphere by posing $k$ trivial queries. Once the search orthant of the optimal rate is known, the algorithm in each iteration of the for loop (line 9-18) updates one angle $\theta_j$ keeping other angles fixed 
% i.e. the $j$-th coordinate of $\theta$ 
by the \emph{ShrinkInterval} subroutine. The \emph{ShrinkInterval}  subroutine (illustrated in Figure~\ref{append:fig:shrink1}) is binary-search based routine that shrinks the interval $[\theta^a_j, \theta^b_j]$ by half based on the oracle responses to (at most) three queries.\footnote{The description of the binary search algorithm in~\cite{hiranandani2019multiclass} always assumes getting responses to four queries that essentially correspond to the four intervals. In practice, the binary search can be adaptive and may only require at most three queries as we have discussed in this paper. The order of the queries in LPME, however, remains the same.} Note that, depending on the oracle responses, one may reduce the search interval to half using less than three queries in some cases. Then the algorithm cyclically updates each angle until it converges to a metric sufficiently close to the true metric. We fix the number of cycles in coordinate-wise binary search to three. Therefore, in order to elicit a linear performance metric in $k$ dimensions, the LPME subroutine requires at most $3 \times 3 \times k \log(\pi/2\epsilon) $ queries, where three is the number of cycles in coordinate wise binary search, three is the (maximum) number of queries to shrink the search interval into half, and the initial search interval for the angles is $\pi/2$. 



\begin{figure}[t]
\begin{minipage}[h]{\textwidth}
  \centering \hspace{-0.5em}
  \begin{minipage}[h]{.51\textwidth}
     \centering
\fbox{\parbox[t]{1\textwidth}{\vspace{0.0cm}\small{\underline{\bf Subroutine \emph{ShrinkInterval}}\normalsize}    \\
\small
\textbf{Input:} Oracle $\Omega$ and rate profiles $\rmbfbar^{(a)},\rmbfbar^{(c)},\rmbfbar^{(d)},\rmbfbar^{(e)},\rmbfbar^{(b)}$\\
Query $\Omega(\rmbfbar^{(a)}, \rmbfbar^{(c)}\,;\,\phi^{\text{lin}})$.\\
\textbf{If} \, ($\rmbfbar^{(a)} \succ \rmbfbar^{(c)}$) Set $\theta_j^{(b)} = \theta_j^{(d)}$.\\
\textbf{else} \, Query $\Omega(\rmbfbar^{(c)}, \rmbfbar^{(d)}\,;\,\phi^{\text{lin}})$.\\
\text{\,\,\,\,} \textbf{If} \, ($\rmbfbar^{(c)} \succ \rmbfbar^{(d)}$) Set $\theta_j^{(b)} = \theta_j^{(d)}$.\\
\text{\,\,\,\,} \textbf{else} \, Query $\Omega(\rmbfbar^{(d)}, \rmbfbar^{(e)}\,;\,\phi^{\text{lin}})$.\\
\text{\,\,\,\,}\text{\,\,\,\,} \textbf{If} \, ($\rmbfbar^{(d)} \succ \rmbfbar^{(e)}$) Set $\theta_j^{(a)} = \theta_j^{(c)}$ and $\theta_j^{(b)} = \theta_j^{(e)}$.\\
\text{\,\,\,\,}\text{\,\,\,\,} \textbf{else} Set $\theta_j^{(a)} = \theta_j^{(c)}$.\\
%%%%%%% binary search part new %%%%%%%%%%%%%
\textbf{Output:} $[\theta_j^{(a)}, \theta_j^{(b)}]$.  
\normalsize \vspace{-0.07cm}
}}
  \end{minipage} \hspace{0.3em}
  \begin{minipage}[h]{.47\textwidth}
     \centering
    %  \captionof{algorithm}{LPM Elicitation}
    %  \label{alg:alg2}
    %  \begin{algorithmic}
    %   \addtocounter{algorithm}{1}
% \balgorithm[t]
% \caption{LPM Elicitation}
% \label{alg:linear}
\fbox{\parbox[t]{0.95\textwidth}{\vspace{0.1cm}
\begin{tikzpicture}[scale = 2.9]
    
% ===============================================
%  RIGHT
% ===============================================

    	\begin{scope}[shift={(-5.0,0)},scale = 0.483]\scriptsize

\def\r{0.06};
	
    % \draw[thick] (0,0) .. controls (1.8,0) and (2.6,1.6) .. (3.2,1.6) 
    % ..controls (3.6,1.6) and (3.8,0) .. (4,0);
    
    % \draw[thick] (0,0) .. controls (0.2,0) and (0.4,1.6) .. (0.8,1.6) 
    % ..controls (1.4,1.6) and (2.2,0) .. (4,0);
    
    \draw[thick] (0,0) .. controls (1.8,0) and (2.6,1.6) .. (3.2,1.6) 
    ..controls (3.6,1.6) and (3.8,0) .. (4,0);
    
    \draw[-latex] (0,-.1)--(0,2.505); 
    \draw[-latex] (-0.1,0)--(4.4,0);
    \node[left] at (0,2.35) {$\phi^{\text{lin}}$};
    \node[below right] at (4.1,0) {$\theta_j$};
   
   	% \coordinate (C1) at (0,0.00);
    % \coordinate (C2) at (1,1.56);
    % \coordinate (C3) at (2,0.76);
    % \coordinate (C4) at (3,0.18);
    % \coordinate (C5) at (4,0.00);
    
    \coordinate (C1) at (0,0.00);
    \coordinate (C2) at (1,0.18);
    \coordinate (C3) at (2,0.76);
    \coordinate (C4) at (3,1.56);
    \coordinate (C5) at (4,0.00);
    
    \node[below] at (0,0) {$\theta_j^{(a)}$};
    \node[below] at (1,0) {$\theta_j^{(c)}$};
    \node[below] at (2,0) {$\theta_j^{(d)}$};
    \node[below] at (3,0) {$\theta_j^{(e)}$};
    \node[below] at (4,0) {$\theta_j^{(b)}$};
    
    \foreach \x in {1,2,3,4} {
    	\draw (\x,-.1) -- (\x,.1);
        \draw[dotted] (\x,0) -- (\x,2);
    }
    \fill[color=black] 
    		(C1) circle (\r)
    		(C2) circle (\r)
            (C3) circle (\r)
            (C4) circle (\r)
            (C5) circle (\r);   
    
    % \draw[thick,-latex] (0.3,0.4) -- (0.7,0.8);
    % \draw[thick,-latex] (1.7,0.4) -- (1.3,0.8);
    % \draw[thick,-latex] (2.7,0.4) -- (2.3,0.8);
    % \draw[thick,-latex] (3.7,0.4) -- (3.3,0.8);
    
    \draw[very thick,-latex] (0.3,0.3) -- (0.7,0.7);
    \draw[very thick,-latex] (1.3,0.5) -- (1.7,0.9);
    \draw[very thick,-latex] (2.3,0.3) -- (2.7,0.7);
    \draw[very thick,-latex] (3.3,0.7) -- (3.7,0.3);
    
    % \draw (0,1.8)--(0,2.2) (2,1.8)--(2,2.2);
    % \draw[<->, dashed, thick] (0,2)--(2,2);
    
    \draw (2,1.8)--(2,2.2) (4,1.8)--(4,2.2);
    \draw[<->] (2,2)--(4,2);
    
    \end{scope}
    
\end{tikzpicture}
\vspace{0.05cm}}
}
  \end{minipage}
%   \captionof{figure}{Two algorithms side by side}
\end{minipage}
\caption{(Left): The \emph{ShrinkInterval} subroutine used in line 16 of Algorithm~\ref{alg:slme} (Right): Visual illustration of the subroutine \emph{ShrinkInterval}~\citep{hiranandani2019multiclass}; \emph{ShrinkInterval} shrinks the current search interval to half based on oracle responses to at most three queries.}
\vskip -0.4cm
\label{append:fig:shrink1}
\end{figure}
% \vspace{-0.2cm}
\section{Geometry Of The Feasible Space (Proofs of Section~\ref{sec:background} and Section~\ref{sec:fairme})}
\label{append:sec:confusion}
\vskip -0.3cm

\bproof[Proof of Proposition~\ref{prop:C} and Proposition~\ref{prop:f-C}]

We prove Proposition~\ref{prop:f-C}. The proof of Proposition~\ref{prop:C} is analogous where the probability measures (corresponding to classifiers and their rates) are not conditioned on any group. 

The group-specific set of rates $\Rcal^g$ for a group $g$ has the following properties~\citep{hiranandani2020fair}:
\vspace{-0.2cm}
\bitemize[leftmargin=1em]
\item \emph{Convex}: Consider two classifiers $h_1^g, h_2^g \in \Hcal^g$ that achieve the rates $\rmbf_1^g, \rmbf_2^g \in \Rcal^g$, respectively. 
% We need to check whether or not  the convex combination $\beta \rmbf_1^g + (1-\beta)\rmbf_2^g$ is feasible, i.e., there exists some classifier which achieve this rate. 
Also, consider a classifier $h^g$ that predicts what classifier $h_1^g$  predicts with probability $\gamma$ and predicts what classifier $h_2^g$ predicts with probability $1-\gamma$. Then the rate vector of the classifier $h^g$ is: 

\vspace{-0.2cm}
\begin{align*}
R_{ij}^g(h) &= \Pmbb(h^g=i | Y=i)  \\ \nonumber
&= \Pmbb(h^g_1=i|h^g=h^g_1, Y=i)\Pmbb(h^g=h_1^g) + \Pmbb( h_2^g=i|h^g=h_2^g, Y=i)\Pmbb(h^g=h^g_2)   \\ \nonumber
&= \gamma \rmbf_{1}^g + (1-\gamma)\rmbf_{2}^g.
\end{align*}
The above equations show that the convex combination of any two rates is feasible as well, i.e., one can construct a randomized classifier which will achieve the convex combination of rates. Hence, $\Rcal^g \; \forall \; g \in [m]$ is convex. Since intersection of convex sets is convex, the intersection set $\Rcal^1\cap \dots \cap \Rcal^m$ is convex as well.  
\item \emph{Bounded:} Since $R^g_{ij}(h) = P[h=i|Y=i] \leq 1$ for all $i\in [k]$, $\Rcal^g \subseteq [0, 1]^k$.
\item \emph{The rates $\ombf$ and  $\embf_i$'s are always achievable:} A uniform random classifier, i.e, the classifier, which for any input, predicts all classes with probability $1/k$ achieves the rate profile $\ombf$. A classifier that always predicts class $i$ achieves the rate $\embf_i$. Thus, $\embf_i \in \Rcal^g \, \forall\, i \in [k], g \in [m]$ are always feasible. 
\item \emph{$\embf_i$'s are vertices:} Consider the supporting  hyperplanes with the following slope: $a_{i} > a_{j} > 0$ and $a_{l}=0$ for $l \in [k], l \neq i, j$. These hyperplanes will be supported by $\embf_i$. Thus, $\embf_i$'s  are vertices of the convex set $\Rcal^g$. From Assumption~\ref{as:sphere}, one can construct a ball around the trivial rate $\ombf$ and thus $\ombf$ lies in the interior.
\eitemize
% \vspace{-0.4cm}
The above points apply to space of overall rates $\Rcal$ as well; thus, proving Proposition~\ref{prop:C}. 
\eproof

% \vspace{-0.2cm}
\subsection{Finding the Sphere $\Scal\subset \Rcal$}
\label{append:ssec:sphere}
% \vskip -0.2cm
% \addtocounter{algorithm}{2}


In this section, we provide details regarding how a sphere $\Scal$ with sufficiently large radius $\rho$ inside the feasible region $\Rcal$ may be found (see Figure~\ref{fig:geometry}(a)). The following discussion is borrowed from~\citep{hiranandani2019multiclass} and provided here for completeness. 
\balgorithm[t]
\caption{Obtaining the sphere $\Scal \subset \Rcal$ (Figure~\ref{fig:geometry}(a)) of radius $\rho$ centered at $\ombf$}
\label{alg:sphere}
\small
\balgorithmic[1]
%\setcounter{AlgoLine}{1}
% \STATE \textbf{Input:} The center $\ombf$ of the feasible region of rates across groups.
\FOR{$j=1, 2, \cdots, k$}
\STATE Let $\mathbf \alphambf_j$ be the standard basis vector. 
\STATE Compute the maximum constant $c_j$ such that $\ombf + c_j \mathbf \alphambf_j$ is feasible by solving~\eqref{eq:op1}.
\ENDFOR
\STATE Let $CONV$ denote the convex hull of $\{\ombf\pm c_j\mathbf \alphambf_j\}_{j=1}^{k}$. It will be centered at $\ombf$.
\STATE Compute the radius $\rho$ of the largest ball that fits in $CONV$.
\STATE\textbf{Output:} Sphere $\Scal$ with radius $\rho$ centered at $\ombf$.
\ealgorithmic
\ealgorithm

The following optimization problem is a special case of OP2 in~\citep{narasimhan2018learning}. The problem is associated with a feasibility check problem.  Given a rate profile $\rmbf_0$, the optimization routine tries to construct a classifier that achieves the rate $\rmbf_0$ within small error $\epsilon >0$. 

\begin{align}
    \min_{\rmbf \in \Rcal} \; 0 \qquad s.t. \;\; \Vert \rmbf - \rmbf_0 \Vert_2 \leq \epsilon.
     \tag{OP1}
    \label{eq:op1}
\end{align}

The above optimization problem checks the feasibility, and if there exists a solution to the above problem, then Algorithm~1 of~\citep{narasimhan2018learning} returns it. 
% The approach in~\cite{narasimhan2018learning} constructs a classifier whose rates are $\epsilon$-close to the given rate $\rmbf_0$. 
Furthermore, Algorithm~\ref{alg:sphere} computes a value of $\rho\geq \tilde{p}/k$, where $\tilde{p}$ is the radius of the largest ball contained in the set $\Rcal$. Also, the approach in~\citep{narasimhan2018learning} is consistent, thus we should get a good estimate of the sphere, provided we have sufficiently large number of samples. The algorithm is completely offline and does not impact oracle query complexity.

\blemma\citep{hiranandani2019multiclass}
    Let $\tilde{p}$ denote the radius of the largest ball in $\Rcal$ centered at $\ombf$. Then Algorithm~\ref{alg:sphere} returns a sphere with radius $\rho\geq \tilde{p}/k$, where $k$ is the number of classes. 
\elemma

The idea in Algorithm~\ref{alg:sphere} can be trivially extended to finding a sphere $\Sbar \subset \Rcal^1\cap\dots\cap\Rcal^m$ corresponding to Remark~\ref{as:f-sphere}.

\section{{Quadratic Performance Metric Elicitation Procedure}}
\label{append:sec:qpme}

In this section, we describe how the subroutine calls to LPME in Algorithm~1 elicit a quadratic metric in Definition~\ref{def:quadmet}. We start with the shifted metric  of Equation~\eqref{eq:loclinapx}. Also, as explained in the main paper, we may assume $d_1 \neq 0$ due to Assumption~\ref{assump:smoothness}. We can derive the following solution using any non-zero coordinate of $\dmbf$, instead of $d_1$. We can identify a non-zero coordinate using $k$ trivial queries of the form $(\varrho\alphambf_i + \ombf, \ombf), \forall i \in [k]$. 

\begin{enumerate}
    \item From line 1 of Algorithm~1, we get local linear approximation at $\ombf$. Using Remark~\ref{rm:ratio}, we have~\eqref{eq:0col} which is
    \begin{equation}
    d_i = \frac{f_{i0}}{f_{10}}d_1 \qquad \forall \; i \in \{2, \dots, k\}.
    \label{append:eq:0col}
            % \vspace{-6pt}
\end{equation}
\item Similarly, if we apply LPME on small balls around rate profiles $\zmbf_j$, Remark~\ref{rm:ratio} gives us:
\begin{equation}
\frac{d_i + (\rho-\varrho)B_{ij}}{d_1 + (\rho-\varrho)B_{1j}} = \frac{f_{ij}}{f_{1j}} \quad \forall \; i \in \{2, \ldots, k\},\; j \leq i.
\label{append:eq:jcol}
\end{equation}

\begin{align*}
    &\implies d_i + (\rho-\varrho)B_{ij} = \frac{f_{ij}}{f_{1j}}(d_1 + (\rho-\varrho)B_{1j})\\
    &\implies (\rho-\varrho)B_{ij} = \frac{f_{ij}}{f_{1j}}(d_1 + (\rho-\varrho)B_{j1}) - d_i \\
    &\implies (\rho-\varrho)B_{ij} = \frac{f_{ij}}{f_{1j}}(d_1 +  \frac{f_{j1}}{f_{11}} (d_1 + (\rho - \varrho)B_{11}) - d_j ) - \frac{f_{i0}}{f_{10}}d_1\\
    &\implies (\rho-\varrho)B_{ij} = \left(\frac{f_{ij}}{f_{1j}} - \frac{f_{i0}}{f_{10}} + \frac{f_{ij}}{f_{1j}} \left(\frac{f_{j1}}{f_{11}} - \frac{f_{j0}}{f_{10}}\right)\right)  d_1 + (\rho-\varrho)  \frac{f_{j1}}{f_{11}}B_{11}, \numberthis \label{append:eq:solvemidssystem}
\end{align*}
where we have used that the matrix $\Bmbf$ is symmetric in the second step, and~\eqref{append:eq:0col} in the last two steps. We can represent each element in terms of $B_{11}$ and $d_1$. So, a relation between $B_{11}$ and $d_1$ may allow us to represent each element of $\ambf$ and $\Bmbf$ in terms of $d_1$.

\item Therefore, by applying LPME on small balls around rate profiles $-\zmbf_1$, Remark~\ref{rm:ratio} gives us~\eqref{eq:negativegrad}:

\begin{equation}
    \frac{d_2-(\rho - \varrho)B_{21}}{d_1-(\rho - \varrho)B_{11}} = \frac{f_{21}^-}{f_{11}^-}.
    \label{append:eq:negativegrad}
            % \vspace{-3pt}
\end{equation}

\item Using~\eqref{append:eq:jcol} and~\eqref{append:eq:negativegrad}, we have:

\begin{align*}
    (\rho - \varrho)B_{11} = \frac{ \frac{f_{21}^-}{f_{11}^-} + \frac{f_{21}}{f_{11}} - 2\frac{f_{20}}{f_{10}}  }{ \frac{f_{21}^-}{f_{11}^-} - \frac{f_{21}}{f_{11}} }d_{1}.
    \numberthis \label{append:eq:firsttermB}
\end{align*}
Putting~\eqref{append:eq:firsttermB} in~\eqref{append:eq:solvemidssystem}, we get:
\begin{align*}
    B_{ij} &=  \left[\frac{f_{ij}}{f_{1j}}\left(1 + \frac{f_{j1}}{f_{11}} \right) - \frac{f_{ij}}{f_{1j}}\frac{f_{j0}}{f_{10}} - \frac{f_{i0}}{f_{10}} +  \frac{f_{ij}}{f_{1j}}\frac{f_{j1}}{f_{11}} \frac{ \frac{f_{21}^-}{f_{11}^-} + \frac{f_{21}}{f_{11}} - 2\frac{f_{20}}{f_{10}}  }{ \frac{f_{21}^-}{f_{11}^-} - \frac{f_{21}}{f_{11}}  }\right]d_1 \\
    &= \left(F_{i,1,j} (1 + F_{j,1,1}) - F_{i,1,j} F_{j,1,0}  - F_{i,1,0} + F_{i,1,j}\frac{F^-_{2,1,1} + F_{2,1,1} - 2F_{2,1,0}}{F^-_{2,1,1} - F_{2,1,1}}\right)d_1,
    \numberthis \label{append:eq:poly2elicitamatfinal}
\end{align*}
where
$F_{i,j,l} = \frac{f_{il}}{f_{jl}}$ and $F^-_{i,j,l} = \frac{f^-_{il}}{f^-_{jl}}$. As $\ambf = \dmbf + \Bmbf \ombf$, we can represent each element of $\ambf$ and $\Bmbf$ using~using~\eqref{append:eq:0col}  and \eqref{append:eq:poly2elicitamatfinal} in terms of $d_1$. We can then use the normalization condition $\Vert \ambf\Vert_2^2 + \Vert \Bmbf \Vert_F^2 = 1$ to get estimates of $\ambf, \Bmbf$ which are independent of $d_1$. 
\end{enumerate}

This completes the derivation of solution from QPME (section~\ref{sec:quadme}).

\section{{Fair (Quadratic) Performance Metric Elicitation Procedure}}
\label{append:sec:fpme}

\begin{figure}[H]
\centering
\fbox{\parbox[t]{0.60\textwidth}{\small{\underline{\bf Algorithm~4: FPM Elicitation}\normalsize}    \\
\textbf{Input:} Query set $\Scal'$, search tolerance $\epsilon > 0$, oracle $\Omega'$ \\
% 1: \text{ \ }$\ambfhat \leftarrow$ LPME$(\Scal, \epsilon, \Omega(\cdot, \cdot \,;\, \Psi\circ\nu(\cdot))$\\
1. \text{ \ }Let $\Lcal \leftarrow \varnothing$ \\
2: \text{ \ }\textbf{For} \, $\sigma \in \Mcal$ \textbf{do}\\
3: \text{ \ \ \ } $\bm{\beta}^{\sigma}\leftarrow$ QPME$(\Scal', \epsilon, \Omega')$\\
4: \text{ \ \ \ } Let $\ell^\sigma$ be Eq.~\eqref{append:eq:fairBij}, extend $\Lcal \leftarrow \Lcal \cup \{\ell^\sigma\}$\\
5:  \text{ \ }$\hat{\Bmbb} \leftarrow $ normalized solution from~\eqref{append:eq:fairbsol} using $\Lcal$\\
6: \text{ \ }$\hat \lambda \leftarrow$ trace back normalized solution from~\eqref{append:eq:fairBij} for any $\sigma$\\
\textbf{Output:} $\ambfhat, \hat{\Bmbb}, \hat \lambda$ 
\normalsize \vspace{-0.25em}
}}
\label{alg:f-linear}
\end{figure}

We first discuss eliciting the fair (quadratic) metric in Definition~\ref{def:f-linmetric}, where all the parameters are unknown. We then provide an alternate procedure for eliciting just the trade-off parameter $\lambda$ when the predictive performance and fairness violation coefficients are known. The latter is a separate application as discussed in~\citep{zhang2020joint}. However, unlike~\cite{zhang2020joint}, instead of ratio queries, we use simpler pairwise comparison queries.

In this section, we work with any number of groups $m\geq 2$. The idea, however, remains the same as described in the main paper for number of groups $m=2$. We specifically select queries from the sphere $\overline{\Scal} \subset \Rcal^1 \cap \dots \cap\Rcal^m$, which is common to all the group-specific feasible region of rates, so to reduce the problem into multiple instances of the proposed QPME procedure of Section~\ref{sec:quadme}. 

Suppose that the oracle's fair performance metric is $\phi^{\text{fair}}$ parametrized by $(\ambf, \Bmbb, \lambda)$  as in Definition~\ref{def:f-linmetric}. The overall fair metric elicitation procedure framework is summarized in Algorithm~4. The framework exploits the sphere $\overline{\Scal} \subset \Rcal^1 \cap \dots\cap\Rcal^m$ and uses the QPME procedure (Algorithm~1) as a subroutine multiple times. 

Let us consider a non-empty set of sets $\Mcal \subset 2^{[m]} \setminus \{\varnothing, [m]\}$. We will later discuss how to choose such a set $\Mcal$. 
% We will later discuss how to choose $\Mcal$ for efficient elicitation. 
We partition the set of groups $[m]$ into two sets of groups. Let $\sigma \in \Mcal$ and $[m] \setminus \sigma$ be one such partition of the $m$ groups defined by the set of groups $\sigma$. For example, when $m=3$, one may choose the set of groups $\sigma = \{1, 2\}$. 

Now, consider a sphere $\Scal'$ whose elements $\rmbf^{1:m} \in \Scal'$ are given by:
\vspace{-0.15cm}
\begin{equation}
    \rmbf^g = \begin{cases}
    \smbf & \text{if } g \in \sigma\\
    \ombf & \text{o.w. }
    \end{cases}
\label{eq:parvarphi}
\end{equation}
\vskip -0.25cm
This is an extension of the sphere $\Scal'$ defined in the main paper for the $m>2$ case. Elements in $\Scal'$ have rate profiles $\smbf \in \overline{\Scal}$ to the groups in $\sigma$ and trivial rate profile $\ombf$ to the remaining groups in $[m] \setminus \sigma$. 
Analogously, the modified oracle is $\Omega'(\rmbf_1, \rmbf_2) = \Omega((\rmbf^{1:m}_1), (\rmbf^{1:m}_2))$, where $\rmbf^{1:m}_1, \rmbf^{1:m}_2$ are the elements of the spheres $\Scal'$ above. 
Thus, for elements in $\Scal'$, the metric in Definition~\ref{def:f-linmetric} reduces to:

\begin{align*}
\phi^{\text{fair}}(\rmbf^{1:m} \in \Scal' \,;\, \ambf, \Bmbb, \lambda) =  
(1-\lambda)\inner{-\ambf \odot \taumbf^\sigma}{\smbf - \ombf} + \lambda \frac{1}{2} (\smbf - \ombf)^T\Wmbf^\sigma(\smbf - \ombf) + c^\sigma 
% &= \overline{\phi}(\smbf; \dmbf, \Bmbf) + c^\sigma
% &\quad =   
% (1-\lambda)\inner{\ambf \odot \taumbf^\sigma}{\smbf_\ombf} + \lambda \frac{1}{2} \smbf_\ombf^T\Wmbf^\sigma\smbf_\ombf + c^\sigma,
\numberthis \label{eq:metricbrich}
\end{align*}

where $\taumbf^\sigma = \sum_{g\in \sigma}\taumbf^g$, $\Wmbf^\sigma = \sum_{u \in \sigma, v \in [m]\setminus\sigma} B^{uv}$, and $c^\sigma$ is a constant not affecting the oracle responses. 


The above metric is a particular instance of $\bphi(\smbf; \dmbf, \Bmbf)$ in~\eqref{eq:quadmetshift} with $\dmbf \coloneqq -(1-\lambda)\ambf\odot\taumbf^\sigma$ and $\Bmbf \coloneqq \lambda \Wmbf^\sigma$; thus, we apply QPME procedure as a subroutine in  Algorithm~4 to elicit the metric in~\eqref{eq:metricbrich}. 

The only change needed to be made to the algorithm is in line 5, where 
we need to take into account the changed relationship between $\dmbf$ and $\ambf$, and need to separately (not jointly) normalize the linear and quadratic coefficients. With this change, the output of the algorithm directly gives us the required estimates. 
% estimates $\ambfhat,\, \Bmbfhat^{12}$ for the predictive and fairness coefficients. 
Specifically, we have from step 1 of Algorithm~1 and \eqref{eq:0col} 
an estimate 
\begin{equation}
 \frac{{d}_{i}}{{d}_{1}} = \frac{\tau^\sigma_{i} {a}_i}{\tau^\sigma_{1} {a}_1} = \frac{f_{i0}}{f_{10}} \implies    {a}_i = \frac{f_{i0}}{f_{10}} \frac{\tau^\sigma_{1}}{\tau^\sigma_{i}} {a}_1.
 \label{append:eq:fair0col}
\end{equation}

Using the normalization condition (i.e., $\Vert \ambf \Vert_2 = 1$), we directly get an estimate $\ambfhat$ for the linear coefficients. Similarly, steps 2-4 of Algorithm~1 and \eqref{eq:poly2elicitamatfinal} gives us:$\hat{B}_{ij} = $
\begin{align*}
    \sum_{u \in \sigma, v \in [m]\setminus\sigma} \tilde B^{uv}_{ij} &= \Big(F_{i,1,j}^\sigma (1 + F_{j,1,1}^\sigma) - F_{i,1,j}^\sigma F_{j,1,0}^\sigma d_{1}
    - F_{i,1,0}^\sigma + F_{i,1,j}^\sigma\textstyle\frac{F^{-, \sigma}_{2,1,1} + F_{2,1,1}^\sigma - 2F_{2,1,0}^\sigma}{F^{-, \sigma}_{2,1,1} - F_{2,1,1}^\sigma}\Big)\tau^1_1\hat{a}_1 \\
    &= \beta^\sigma,  \numberthis \label{append:eq:fairBij}
\end{align*}
where the above solution is similar to the two group case, but here it is corresponding to a partition of groups defined by $\sigma$, and $\tilde \Bmbf^{uv} \coloneqq \lambda\Bmbf^{uv}/(1 - \lambda)$ is a scaled version of the true (unknown) $\Bmbf^{uv}$. Let equation~\eqref{append:eq:fairBij} be denoted by $\ell^\sigma$. Also, let the right hand side term of~\eqref{append:eq:fairBij} be denoted by $\beta^\sigma$. 

Since we want to elicit $m\choose 2$ fairness violation weight matrices in $\Bmbb$, we require $m\choose 2$ ways of partitioning the groups into %protected and unprotected groups
two sets so that we construct $m\choose 2$ independent matrix equations similar to~\eqref{append:eq:fairBij}. 
% This is easily achievable by choosing $m \choose 2$ $\sigma$'s so that we get $m \choose 2$ set of unique equations like~\eqref{eq:midsolb}. 
Let $\Mcal$ be those set of sets. 
Thus, running over all the choices of sets of groups $\sigma \in \Mcal$ provides the system of equations $\Lcal \coloneqq \cup_{\sigma \in \Mcal} \ell^\sigma$ (line 4 in Algorithm~4), which is:
\begin{equation}
    \left[ \begin{array}{cccc} \Xi & 0 & \dots & 0\\
    0 & \Xi & \dots & 0 \\
    \dots & \dots & \dots & \dots \\
    0 & 0 & \dots & \Xi 
    \end{array}\right] \left[ \begin{array}{c} \tilde \bmbf_{(11)} \\
    \tilde \bmbf_{(12)} \\
    \dots \\
    \tilde \bmbf_{(kk)}
    \end{array}\right] = \left[ \begin{array}{c} \bm\beta_{(11)} \\
    \bm\beta_{(12)} \\
    \dots \\
    \bm\beta_{(kk)}
    \end{array}\right],
    \label{append:btilde}
\end{equation}
where $\tilde \bmbf_{(ij)} = (\tilde b_{ij}^1,\tilde b_{ij}^2, \cdots, \tilde b_{ij}^{m\choose 2})$ and $\gammambf_{(ij)} = (\beta_{ij}^1, \beta_{ij}^2, \cdots, \beta_{ij}^{m\choose 2})$ are vectorized versions of the $ij$-th entry across groups for $i, j \in [k]$, and $\Xi \in \{0,1\}^{{m\choose 2}\times {m\choose 2}}$ is a binary full-rank matrix denoting membership of groups in the set $\sigma$. For example, when one chooses $\Mcal = \{ \{1,2\}, \{1,3\}, \{2,3\}\}$ for $m=3$, $\Xi$ is given by:
$$\Xi = \left[ \begin{array}{ccc} 0 & 1 & 1\\
    1 & 0 & 1\\
    1 & 1 & 0\\
\end{array}\right].$$
One may choose any set of sets $\Mcal$ that allows the resulting group membership matrix $\Xi$ to be  non-singular. The solution of the system of equations $\Lcal$ is:
\begin{equation}
    \left[ \begin{array}{c} \tilde \bmbf_{(11)} \\
    \tilde \bmbf_{(12)} \\
    \dots \\
    \tilde \bmbf_{(kk)}
    \end{array}\right] = \left[ \begin{array}{cccc} \Xi & 0 & \dots & 0\\
    0 & \Xi & \dots & 0 \\
    \dots & \dots & \dots & \dots \\
    0 & 0 & \dots & \Xi
    \end{array}\right]^{(-1)} \left[ \begin{array}{c} \bm\beta_{(11)} \\
    \bm\beta_{(12)} \\
    \dots \\
    \bm\beta_{(kk)}
    \end{array}\right].
    \label{append:eq:sol-b}
\end{equation}
% The vectors on the left and the right hand side of the above equation are of sizes $\frac{m(m-1)}{2}\times q$ and the matrix is of size $[\frac{m(m-1)}{2}\times q]^2$.
When all $\tilde \Bmbf^{uv}$'s are normalized, we have the estimated fairness violation weight matrices as:
\begin{equation}
\Bmbfhat^{uv} = \frac{\tilde \Bmbf^{uv}}{\frac{1}{2}\sum_{u,v=1, v > u}^m \Vert \tilde \Bmbf^{uv} \Vert_F} \quad \text{for} \quad u,v \in [m], v>u.
 \label{append:eq:fairbsol}
\end{equation} 
Due to the above normalization, the solution is again independent of the true trade-off $\lambda$.

Given estimates $\hat{B}^{uv}_{ij}$ and $\ahat_1$,  we can now additionally estimate the trade-off parameter $\hat{\lambda}$ from  $\ell^\sigma$~\eqref{append:eq:fairBij} for any $\sigma \in \Mcal$. This completes the fair (quadratic) metric elicitation procedure. 

\subsection{Eliciting Trade-off $\lambda$ when (linear) predictive performance and (quadratic) fairness violation coefficients are known}
\label{append:ssec:lambda}

We  now provide an alternate binary search based method similar to~\cite{hiranandani2020fair} for eliciting the trade-off parameter $\lambda$ when the linear predictive and quadratic fairness coefficients are already known. This is along similar lines to the application considered by~\cite{zhang2020joint}, but unlike them, instead of ratio queries, we require simpler pairwise queries. 

Here, the key insight is to approximate the non-linearity posed by the fairness violation in Definition~\ref{def:f-linmetric}, which then reduces the problem to a  one-dimensional binary search. We have:
\begin{align*}
&\phi^{\text{fair}}(\tupr \,;\, \ambf, \Bmbb, \lambda) \,\coloneqq\,  (1-\lambda)\inner{\ambf}{\bm{1} - \rmbf} + \lambda \frac{1}{2} \left(\sum\nolimits_{u,v=1,v>u}^{m} (\rmbf^u - \rmbf^v)^T\mathbbm{\Bmbf}^{uv}(\rmbf^{u} - \rmbf^v)\right). \numberthis \label{append:eq:fairmetshifted}
\end{align*}
To this end, we define a new sphere $\Scal' = \{ (\smbf,\ombf, \dots, \ombf)  | \smbf \in \overline{\Scal}\}$. The elements in $\Scal'$ is the set of rate profiles whose first group achieves rates $\smbf \in \overline{\Scal}$ and rest of the groups achieve trivial rate $\ombf$ (corresponding to uniform random classifier). For any element in $\Scal'$, the associated discrepancy terms $(\rmbf^u - \rmbf^v) = 0$ for $u,v \neq 1$. 
% For the remaining discrepancy terms, the sign of the absolute function in~\eqref{eq:linmetric} is known since $\smbf^+ \geq \ombf$. 
Thus for elements in $\Scal'$, the metric in Definition~\ref{def:f-linmetric} reduces to:
\vspace{-0.2cm}
\begin{align*}
        \phi^{\text{fair}}((\smbf, \ombf, \dots, \ombf) \,;\,\ambf, \Bmbb, \lambda) =& (1-\lambda)\inner{-\taumbf^1\odot\ambf}{\smbf - \ombf} + 
        \lambda\frac{1}{2} (\smbf - \ombf)^T\sum_{v=2}^m \Bmbf^{1v} (\smbf - \ombf) + c.
         \numberthis \label{append:eq:metriclambda}
\end{align*}
\vskip -0.3cm
Additionally, we consider a small sphere $\overline{\Scal}'_{\zmbf_1}$, where $\zmbf_1 \coloneqq (\rho - \varrho)\bm{\alpha}_1 + \ombf$, similar to what is shown in Figure~\ref{fig:geometry}(a). We may approximate the quadratic term on the right hand side above by its first order Taylor approximation as follows:
\begin{align*}
        \phi^{\text{fair}}( (\smbf, \ombf, \dots, \ombf) ;\ambf, \Bmbb, \lambda) &\approx  \phi^{\text{fair, apx}}( (\smbf, \ombf, \dots, \ombf) ;\ambf, \Bmbb, \lambda) \\ &= \inner{-(1-\lambda)\taumbf^1\odot\ambf + \lambda \sum_{v=2}^m \Bmbf^{1v}(\zmbf_1 - \ombf)}{\smbf}
         \numberthis \label{eq:metriclambdalinear}
\end{align*}
for $\smbf$ in a small neighbourhood around the rate profile $\zmbf_1$. Since the metric is essentially linear in $\smbf$, the following lemma from~\citep{hiranandani2020fair} shows that the metric in~\eqref{eq:metriclambdalinear} is quasiconcave in $\lambda$. 

\vspace{-0.1cm}
\blemma
Under the regularity assumption that $\inner{-\taumbf^1\odot\ambf}{\sum_{v=2}^m \Bmbf^{1v}(\zmbf_1 - \ombf)}\neq 1$, the function
% \vspace{-0.1cm}
\begin{equation}
\vartheta(\lambda) \coloneqq \max_{\smbf \in \overline{\Scal}'_{\zmbf_1}} \phi^{\text{fair, apx}}( (\smbf, \ombf, \dots, \ombf) ;\ambf, \Bmbb, \lambda)
\label{append:eq:vartheta}
\end{equation}
% \vskip -0.3cm
is strictly quasiconcave (and therefore unimodal) in $\lambda$.
\label{lm:lambda}
\elemma
\vskip -0.2cm
The unimodality of $\vartheta(\lambda)$ allows us to perform the one-dimensional binary search in Algorithm~\ref{alg:lambda} using the query space $\overline{\Scal}'_{\zmbf_1}$, tolerance $\epsilon$, and the oracle $\Omega$. The binary search algorithm is same as Algorithm~4 in~\citep{hiranandani2020fair} and provided here for completeness. 

\addtocounter{algorithm}{1}
\balgorithm[t]
\caption{Eliciting the trade-off $\lambda$ when predictive performance and fairness violation are known}
\label{alg:lambda}
\small
\balgorithmic[1]
%\setcounter{AlgoLine}{1}
\STATE \textbf{Input:} Query space $\overline{\Scal}'_{\zmbf_1}$, binary-search tolerance $\epsilon > 0$, oracle $\Omega$
\STATE \textbf{Initialize:} $\lambda^{(a)} = 0$, $\lambda^{(b)} = 1$.
\WHILE{$\abs{\lambda^{(b)} - \lambda^{(a)}} > \epsilon$} 
\STATE Set $\lambda^{(c)} = \frac{3 \lambda^{(a)} + \lambda^{(b)}}{4}$, $\lambda^{(d)} = \frac{\lambda^{(a)} + \lambda^{(b)}}{2}$, $\lambda^{(e)} = \frac{\lambda^{(a)} + 3 \lambda^{(b)}}{4}$
\STATE Set $\smbf^{(a)} = \displaystyle\argmax_{\smbf \in\overline{\Scal}'_{\zmbf_1}} \inner{-(1-\lambda^{(a)})\taumbf^1\odot\ambfhat + \lambda^{(a)} \sum_{v=2}^m \Bmbfhat^{1v}(\zmbf_1 - \ombf)}{\smbf}$ using Lemma~\ref{lem:spherebayes}
% \STATE Set $\smbf_a^- = \displaystyle\argmax_{\smbf^+\in\Scal_\varrho^+} \inner{(1-\lambda_a)\taumbf^1\odot\ambfhat + \lambda_a \sum_{v=2}^m \bmbfhat^{1v}}{\smbf^+}$
\STATE Similarly, set $\smbf^{(c)}$, $\smbf^{(d)}$, $\smbf^{(e)}$, $\smbf^{(b)}$.
% \STATE Query  $\Omega(\smbf^{(c)}, \smbf^{(a)})$,  $\Omega(\smbf^{(d)}, \smbf^{(c)})$,  $\Omega(\smbf^{(e)}, \smbf^{(d)})$, and  $\Omega(\smbf^{(b)}, \smbf^{(e)})$.\\
\STATE $[\lambda^{(a)}, \lambda^{(b)}] \leftarrow$ \emph{ShrinkInterval} $(\Omega, \smbf^{(a)}),\smbf^{(c)}),\smbf^{(d)}),\smbf^{(e)}),\smbf^{(b)}) )$ using a subroutine analogous to the routine shown in  Figure~\ref{append:fig:shrink1}.
\ENDWHILE
% \text{ \ \ \ } Set $\lambda_d = \frac{\lambda_a+\lambda_b}{2}$. Then set $\ahat_i = \frac{1-m^d}{m^d}\hat a_1$.\\
\STATE \textbf{Output:} $\hat\lambda = \frac{\lambda^{(a)}+\lambda^{(b)}}{2}$. 
\ealgorithmic
\ealgorithm
% \vspace{-0.3cm}

\section{Extension to Eliciting General Quadratic Metrics}
\label{append:generalquad}

In this section, we discuss how the entire setup including the proposed procedure and the guarantees of the main paper described in terms of the \emph{diagonal} entries of the predictive rate matrix extends to a setup where the metric is defined in all the terms of the rate matrix. For this section, we use an additional notation. For a matrix $\Ambf$, let $\offdiag(\Ambf)$ returns a vector of off-diagonal elements of $\Ambf$.  

Just like the main paper, we consider a $k$-class classification setting with $X \in \Xcal$ and $Y \in [k]$ denoting the input and output random variables, respectively. We assume access to an $n$-sized sample $\{(\xmbf, y)_i\}_{i=1}^n$ generated \emph{iid} from a distribution $ \Pmbb(X, Y)$. We work with randomized classifiers $h : \Xcal \rightarrow \Delta_k$ that for any $\xmbf$ gives a distribution $h(\xmbf)$ over the $k$ classes and use 
 $\Hcal = \{h : \Xcal \rightarrow \Delta_k\}$
 to denote the set of all classifiers. 

\emph{General Predictive rates:} 
We define the predictive rate matrix for a classifier $h$ by $\Rmbf(h, \Pmbb) \in \Rmbb^{k \times k}$, where the $ij$-th entry is the fraction of label-$i$ examples for which the randomized classifier $h$ predicts $j$:
\vspace{0.15cm}
\begin{align}
	R_{ij}(h, \Pmbb) \coloneqq \Pmbb(h(X) = j | Y = i)  \quad \text{for} \; i, j \in [k],
	\label{eq:generalcomponents}
\end{align}
\vskip 0.15cm
where the probability is over draw of $(X, Y) \sim \P$ and the randomness in $h$. 

Notice that each diagonal entry
of $\Rmbf$
% of this matrix 
can be written in terms of its off-diagonal elements: %predictive rates satisfy the following useful decomposition:
\vspace{0.15cm}
% \begin{equation}
    $${R_{ii}(h, \Pmbb) = 1 - \sum\nolimits_{j=1,j\neq i}^k R_{ij}(h, \Pmbb).}$$
    % \label{eq:decomp}
% \end{equation}
\vskip 0.15cm
% Using this decomposition, 
Thus, we can represent a rate matrix with its $q \coloneqq (k^2 - k)$ off-diagonal elements, write it as a vector $\rmbf(h, \Pmbb) = \offdiag(\Rmbf(h, \Pmbb))$, and interchangeably refer to it as the \emph{`vector of general rates'} or \emph{`off-diagonal rates'}. To distinguish from rates considered in the main paper, we will call the rates entries corresponding to the diagonals of the rate matrix, i.e., $\Pmbb(h(X)=i|Y=i)$ as discussed in Equation\eqref{eq:components}, as the \emph{`diagonal rates'}. 

\emph{Feasible general rates:} The set of all feasible general rates is given by: \vspace{0.15cm}
$$\Rcal = \{\rmbf(h, \Pmbb) \in [0,1]^q\,:\, h \in \Hcal \}.$$ 
\vspace{0.15cm}

The quadratic metric in general rates is defined in the same way as Definition~\ref{def:quadmet} as follows:

\bdefinition[Quadratic Metric in General Rates] For a vector $ \ambf \in \Rmbb^q$ and a positive semi-definite symmetric matrix $\Bmbf \in \Rmbb^{q \times q}$ with $\Vert \ambf \Vert_2^2  + \Vert \Bmbf \Vert_F^2 = 1$ (w.l.o.g.\ due to scale invariance):
% we define:
\vspace{-0.2cm}
\begin{equation}
    \phi^\quadr(\rmbf \,;\, \ambf, \Bmbf) = \inner{\ambf}{\rmbf} + \frac{1}{2} \rmbf^T \Bmbf \rmbf.
    \label{eq:generalquadmet}
\end{equation}
\vspace{-0.6cm}
\label{def:generalquadmet}
\edefinition

\bexample[Distribution matching]
\emph{
We can extend Example~\ref{ex:distmatchbin} in the multiclass case as follows. In certain applications, one needs the proportion of predictions 
%made by a classifier 
for each class (i.e., the coverage) to match a target distribution $\boldsymbol{\pi} \in \Delta_k$ 
% \cite{goh2016satisfying,narasimhan2018learning, narasimhan2019optimizing,Cotter:2019}. 
\citep{goh2016satisfying,narasimhan2018learning}. 
A %evaluation 
measure often used for this task is the squared difference between the per-class coverage and the target distribution: 
{\small$\phi^{\cov}(\rmbf) \,=\, \sum_{i=1}^k \left(\cov_i(\rmbf) - \pi_i\right)^2$}, where 
% $\cov_i(\rmbf) = 1 - \sum_{j=1}^{k-1}r_{(i-1)(k-1) + j} + \sum_{j\ne i}r_{i + (j-1)(k-1)}$.
{\small$\cov_i(\rmbf) = 1 - \sum_{j=1}^{k-1}r_{(i-1)(k-1) + j} + \sum_{j> i}r_{(j-1)(k-1)+i}+ \sum_{j<i}r_{(j-1)(k-1)+i-1}$}. 
Similar metrics can be found in the quantification literature where the target is set to the class prior $\Pmbb(Y=i)$ \citep{Fab1, 
%Fab2, 
Kar16}. %,  in combination with an additional error term. 
We capture more general quadratic distance measures for distributions, e.g.\ {\small$(\bf{\cov}(\rmbf) - \boldsymbol{\pi})^{T}\Qmbf (\bf{\cov}(\rmbf)-\boldsymbol{\pi})$} for a positive semi-definite matrix $\Qmbf \in PSD_k$ \citep{Lindsay08}.
}
\eexample


The definition of metric elicitation and oracle query remain the same except that the vector $\rmbf$ now represents the vector of general rates, and not just the diagonal rates. 


Consider Appendix~\ref{append:sec:confusion}, where we discuss proof of Proposition~\ref{prop:C}, Proposition~\ref{prop:f-C}, and a procedure to construct a feasible sphere of appropriate radius in the convex set of diagonal rates. The entire methodology applies to general set of rates by \emph{just} replacing diagonal rates in the proofs with the general rates. Thus, all the geometrical properties discussed in Proposition~\ref{prop:C} for the set of diagonal rates applies to the set of general rates.
The exact geometry of the set of diagonal rates, as shown in Figures~\ref{fig:geometry}(a) and \ref{fig:geometry}(b)may differ from the geometry of the set of general rates; however, the geometric properties including $\embf_i$ being the vertices remains the same. 
Therefore, under the same Assumption~\ref{assump:distribution}, we can guarantee an existence of a sphere in the set of general rates similar to Remark~\ref{as:sphere}. 

Once we guarantee a sphere in the set of general rates, we can follow LPME for eliciting linear metrics in general rates or QPME for eliciting linear metrics in general rates. The computational and query complexity will depend on the number of unknowns, which in the case of general rates, will be $\tilde O(q)$ for LPME and $\tilde O(q^2)$ for QPME.

\section{Practicality of Querying Oracle}
\label{append:practicality}

Recall that in our setup, any query posed to the oracle needs to feasible, i.e., should be achievable by some classifier (see definition of feasible rates in Section~\ref{sec:background}). Therefore the oracle we query can be a human expert or a group of experts who compare intuitive visualizations of  rates, or can be an entire  population of users (as would be the case with A/B testing).

An important practical concern in employing the proposed QPME procedure is the number of queries needed to be posed to the oracle. We  note that (i) the number of queries needed by our proposal is optimal (i.e.\ matches the lower bound for the problem in Theorem \ref{thm:lb}), (ii) has only a \emph{linear} dependence on the number of unknowns (Theorem \ref{thm:q-me}), and (iii) can be considerably reduced by making reasonable practical structural assumptions about the metric to reduce the number of unknowns. While our procedure's query complexity for the most general setup (with $O(k^2)$ unknowns) is $\tilde{O}(k^2)$, the quadratic dependence on $k$ is merely an artifact of there being $O(k^2)$ unknowns in this setup. For example, when the number of classes is large, one may just cluster the classes from error perspective. For example, one may assume same error costs for similar classes. This will reduce the number of unknowns to $O(c^2)$, where $c<<k$ is the number of cluster of classes.


We also stress that in many internet-based settings, one can deploy A/B tests to obtain preferences by aggregating feedback from a large group of participants. In this case, the entire user population serves as an oracle. Most internet-based companies run thousands of A/B tests daily making it practical to get preferences for our metric elicitation procedure. Moreover, one can employ practical improvements such as running A/B tests with fewer participants in the initial rounds (when the rates are far apart) and switch to running A/B tests with more precision in later rounds. Note that because the queries posed by our method always corresponds to a feasible classifier (see Section~\ref{sec:quadme}), one can easily run comparisons between classifiers as a part of an A/B test. 

If needed, our algorithms can also work with queries that compare classification statistics directly, instead of classifiers. There has been growing work on visualization of confusion matrices (predictive rates) for non-expert users. For example, see~\citep{beauxis2014visualization} and~\citep{shen2020designing}. With the aid of such intuitive visualizations, it is reasonable to expect human practitioners to comprehend the queries posed to them and provide us with pairwise comparisons. Moreover, we have shown that our approach is resilient to noisy responses, which enhances our confidence in their ability to handle human feedback. 
% Further, (see~\cite{beauxis2014visualization,zhang2020joint} and a more recent work by Shen et al.~\citeyearpar{shen2020designing} for visualizations of error statistics). 
In practice, the most viable option will depend on the target population. 

Finally, we would like to emphasize that because our query complexity  has only a linear dependence on the number of unknowns, and the number of unknowns can be reduced with practical structural assumptions, our proposal is as practical as the prior methods \citep{hiranandani2018eliciting, hiranandani2019multiclass} for linear metric elicitation. In fact, despite eliciting from a more flexible class, our proposal has the same dependence on the number of unknowns as those prior methods.

\section{Elicitation Guarantee for the QPME Procedure}
\label{append:sec:guarantees}
% \vskip -0.2cm
\subsection{Sample complexity bounds} Recall from Definition~\ref{def:noise} that the oracle responds correctly as long as $|\phi(\rmbf_1) - \phi(\rmbf_2)| > \epsilon_\Omega$. For simplicity, we assume that our algorithm %(and in turn the subroutine LPME) 
has  access to the population rates $\rmbf$ defined in Eq.~(1). 
In practice, we expect  to estimate the rates using a sample $D\coloneqq \{\xmbf, y\}_{i=1}^n$ drawn from the distribution $\Pmbb$, and to query classifiers from a hypothesis class $\mathcal{H}$ with finite capacity. Standard generalization bounds (e.g.~\cite{Daniely:2015}) give us that with high probability over draw of $D$, the estimates  $\hat{\rmbf}$ are close to the population rates $\rmbf$, up to the desired  tolerance $\epsilon_\Omega$, 
as long as we have sufficient samples. Further, since the metrics $\phi$ are Lipschitz w.r.t.\ rates, with high probability, %the LPME routine 
we thus gather correct oracle feedback from querying with finite sample estimates $\Omega(\hat{\rmbf}_1, \hat{\rmbf}_2)$.

More formally, for $\delta \in (0,1)$, as long as the  sample size $n$ is greater than ${O\big(\sfrac{\log(|\mathcal{H}|/\delta)}{\epsilon_\Omega^2}\big)}$, the guarantee in Theorem 1 hold with probability at least $1 - \delta$ (over draw of $D$), where $|\mathcal{H}|$ can in turn be replaced by a measure of capacity of the hypothesis class $\mathcal{H}$. For example, one can show the following corollary to Theorem \ref{thm:q-me} for a
hypothesis class $\mathcal{H}$ in which each classifier is a randomized combination of a finite number of deterministic classifiers chosen from a set $\bar{\mathcal{H}}$, and whose capacity is measured in terms of the Natarajan dimension~\citep{Natarajan:1989} of $\bar{\mathcal{H}}$.
\begin{corollary}
Suppose the hypothesis class $\mathcal{H}$ of randomized classifiers used to choose queries to the oracle is  of the form:
$$\mathcal{H} =\bigg\{x \mapsto \sum_{t=1}^T\alpha_t h_t(x) \,\bigg|\, T \in \mathbb{Z}_+, \alpha \in \Delta_T, h_1, \ldots, h_T \in \bar{\mathcal{H}}\bigg\},$$ for some class $\bar{\mathcal{H}}$ of deterministic multiclass classifiers $h: \mathcal{X} \rightarrow \{0,1\}^k$. Suppose the deterministic hypothesis class $\bar{\mathcal{H}}$  has   Natarajan dimension $d > 0$, and $\phi$ is $1$-Lipschitz. Then for any $\delta \in (0,1)$,
as long as the  sample size $n 
\geq O\Big(\frac{d\log(k) + \log(1/\delta)}{\epsilon_\Omega^2}\Big)$, the guarantee in Theorem 1 hold with probability at least $1 - \delta$ (over draw of $D = \{\xmbf_i, y_i\}_{i=1}^n$ from $\Pmbb$).
\label{append:cor:finite}
\end{corollary}
The proof adapts generalization bounds from~\cite{Daniely:2015}, and uses the fact that the predictive rate for any randomized classifier in $\mathcal{H}$ is a convex combination of rates for deterministic classifiers in $\bar{\mathcal{H}}$ (due to linearity of expectation). 

\subsection{Proofs}
Before presenting the proof of Theorem \ref{thm:q-me}, we re-write the LPME guarantees from~\citep{hiranandani2019multiclass} for linear metrics in the presence of an oracle noise parameter $\epsilon_\Omega$ from Definition~\ref{def:noise}. 

\blemma[LPME guarantees with oracle noise~\citep{hiranandani2019multiclass}]
\label{lem:LPMEwnoise}
Let the oracle $\Omega$'s metric be $\phi^{\text{lin}} = \inner{\ambf}{\rmbf}$ and its feedback noise parameter from Definition~\ref{def:noise} be $\epsilon_\Omega$. Then, if the LPME procedure (Algorithm~\ref{alg:slme}) is run using a 
sphere $\Scal \subset \Rcal$ of radius $\varrho$ and the  binary-search tolerance $\epsilon$, then by posing $O(k\log(1/\epsilon))$
queries it recovers coefficients $\ambfhat$ with $\Vert \ambf - \ambfhat \Vert_2 \leq O\left(\sqrt{k}(\epsilon + \sqrt{\epsilon_\Omega/\varrho})\right)$.
\elemma

\bproof[Proof of Theorem~\ref{thm:q-me}] 

We first find the smoothness coefficient of the metric in Definition~\ref{def:quadmet}.

A function $\phi$ is said to be $L$-smooth if for some bounded constant $L$, we have:

$$
\Vert \nabla \phi(x) - \nabla \phi(y) \Vert_2 \leq L\Vert x - y \Vert_2.
$$

For the metric in Definition~\ref{def:quadmet}, we have:
\begin{align*}
\Vert \nabla \phi^{\text{quad}}(x) - \nabla \phi^{\text{quad}}(y) \Vert_2 &= \Vert \ambf + \Bmbf\xmbf - (\ambf + \Bmbf\ymbf) \Vert_2 \\
&\leq \Vert \Bmbf \Vert_2 \Vert x - y \Vert_2\\
% &= \sigma_{\text{max}} \Vert x - y \Vert_2,
&\leq \Vert \Bmbf \Vert_F \Vert x - y \Vert_2,\\
&\leq 1\cdot\Vert x - y \Vert_2,
\end{align*}
where in the last step, we have used the scale invariance condition from Definition~\ref{def:quadmet}, i.e., $\Vert \ambf \Vert^2_2 + \Vert \Bmbf \Vert^2_F = 1$, which implies that  $\Vert \Bmbf \Vert^2_F = 1 - \Vert \ambf \Vert^2_2 \leq 1$. 
% $\sigma_{\text{max}}$ is the maximum singular value of the matrix $\Bmbf$. By Assumption~\ref{assump:smoothness}, $\sigma_{\text{max}}$ is bounded; 
Hence, the metrics in Definition~\ref{def:quadmet} are $1$-smooth. 

Now, we look at the error in Taylor series approximation when we approximate the metric $\phi^{\text{quad}}$ in  Definition~\ref{eq:quadmet} with a linear approximation. Our metric is 

$$
\phi^{\text{quad}}(\rmbf) = \inner{\ambf}{\rmbf} + \frac{1}{2}\rmbf^T\Bmbf\rmbf.
$$

We approximate it with the first order Taylor polynomial around a point $\zmbf$, which we define as follows:

$$
T_1(\rmbf) = \inner{\ambf}{\zmbf} + \frac{1}{2}\zmbf^T\Bmbf\zmbf + \inner{\ambf + \Bmbf\zmbf}{\rmbf}
$$
The bound on the error  in this approximation is:
\begin{align*}
\vert E(\rmbf) \vert &= \vert \phi^{\text{quad}}(\rmbf) - T_1(\rmbf) \vert   \\
&= \frac{1}{2} \vert (\rmbf -\zmbf)^T \Delta\phi^{\text{quad}}|_\cmbf  (\rmbf -\zmbf) \vert \qquad\qquad  \text{(First-order Taylor approximation error)} \\
&= \frac{1}{2} \vert (\rmbf -\zmbf)^T \Bmbf  (\rmbf -\zmbf) \vert \qquad\qquad\qquad \;\;\; \text{(Hessian at any point $\cmbf$ is the matrix $\Bmbf$)}\\
% &\leq \frac{1}{2}\sigma_{\text{max}} \Vert \rmbf - \zmbf \Vert _2 \\
% &= \frac{1}{2}\sigma_{\text{max}} \varrho,
&\leq \frac{1}{2}\Vert \Bmbf \Vert_2 \Vert \rmbf - \zmbf \Vert _2^2 \\
&\leq \frac{1}{2}\Vert \Bmbf \Vert_F \varrho^2 \\
&\leq  \frac{1}{2} \varrho^2 \qquad\qquad\qquad\qquad\qquad\qquad\qquad \text{(Due to the scale invariance condition)}
\end{align*}

So when the oracle is asked $\Omega(\rmbf_1, \rmbf_2) = \1[\phi^{\text{quad}}(\rmbf_1) > \phi^{\text{quad}}(\rmbf_2)]$, the approximation error can be treated as feedback error from the oracle with feedback noise 
% $2\times \frac{1}{2} \sigma_{\text{max}}\rho$. 
$2\times \frac{1}{2} \varrho^2$. 
Thus, the overall feedback noise by the oracle is 
% $\epsilon_\Omega + \sigma_{\text{max}}\rho$.
$\epsilon_\Omega + \varrho^2$ for the purposes of using Lemma~\ref{lem:LPMEwnoise} later. 

We first prove guarantees for the matrix $\Bmbf$ and then for the vector $\ambf$. We write Equation~\eqref{eq:poly2elicitamatfinal} in the following form assuming $d_1 = 1$ (since we normalize the coefficients at the end due to scale invariance): 

\begin{align*}
B_{ij} &= F_{ij} =  \left[\frac{f_{ij}}{f_{1j}}\left(1 + \frac{f_{j1}}{f_{11}} \right) - \frac{f_{ij}}{f_{1j}}\frac{f_{j0}}{f_{10}} - \frac{f_{i0}}{f_{10}} + \frac{f_{ij}}{f_{1j}}\frac{f_{j1}}{f_{11}} \frac{ \frac{f_{21}^-}{f_{11}^-} + \frac{f_{21}}{f_{11}} - 2\frac{f_{20}}{f_{10}}  }{ \frac{f_{21}^-}{f_{11}^-} - \frac{f_{21}}{f_{11}}  }\right]. \\
\implies \Bmbf[:, j] &= \fmbf_j\left( \frac{1}{f_{1j}} + \frac{f_{j1}}{f_{1j}f_{11}} + \frac{f_{j0}}{f_{1j}f_{10}} + \frac{f_{j1}}{f_{1j}f_{11}}\left(  \frac{ \frac{f_{21}^-}{f_{11}^-} + \frac{f_{21}}{f_{11}} - 2\frac{f_{20}}{f_{10}}  }{ \frac{f_{21}^-}{f_{11}^-} - \frac{f_{21}}{f_{11}}  } \right) \right) + \fmbf_0\frac{1}{f_{10}} \\
&= c_j\fmbf_j + c_0\fmbf_0, \numberthis \label{eq:Bj}
\end{align*} 
where $\Bmbf[:, j]$ is the $j$-th column of the matrix $\Bmbf$, and the constants $c_j$ and $c_0$ are well-defined due to the regularity Assumption~\ref{as:regularity-q}. Notice that,
$$
\frac{\partial \Bmbf[:, j]}{\partial \fmbf_j} = \diag(\cmbf'_j)\odot\Imbf \quad, \text{and} \quad 
\frac{\partial \Bmbf[:, j]}{\partial \fmbf_0} = \diag(\cmbf'_0)\odot\Imbf,
$$
where $\cmbf'_j, \cmbf'_0$ are vector of Lipschitz constants (bounded due to Assumption~\ref{as:regularity-q}). This implies

\begin{align*}
\Vert \Bmbfbar[:, j] - \Bmbfhat[:, j]\Vert_2 &\leq c'_j \Vert \fmbfbar_j - \fmbfhat_j \Vert_2 + c'_0 \Vert \fmbfbar_0 - \fmbfhat_0 \Vert_2\\
&\leq c'_j\sqrt{k}\left(\epsilon + \sqrt{\varrho + \epsilon_\Omega/\varrho}\right) + c'_0\sqrt{k}\left(\epsilon + \sqrt{\varrho + \epsilon_\Omega/\varrho}\right) \\
&= O\left(\sqrt{k}\left(\epsilon + \sqrt{\varrho + \epsilon_\Omega/\varrho}\right)\right),
\end{align*}
where we have used LPME guarantees from Lemma~\ref{lem:LPMEwnoise} under the oracle-feedback noise parameter $\epsilon_\Omega + \varrho^2$. 

The above inequality provides bounds on each column of $\Bmbf$. Since $\Vert \xmbf \Vert_\infty \leq \Vert \xmbf \Vert_2$, we have $\max_{ij}\vert B_{ij} - \hat{B}_{ij} \vert \leq O\left(\sqrt{k}\left(\epsilon + \sqrt{\varrho + \epsilon_\Omega/\varrho}\right)\right)$, and consequentially, $\Vert \Bmbf - \Bmbfhat \Vert_F \leq O\left(k\sqrt{k}\left(\epsilon + \sqrt{\varrho + \epsilon_\Omega/\varrho}\right)\right)$. 

Now let us look at guarantees for $\ambf$. Since $\ambf = \dmbf - \Bmbf\ombf$ from~\eqref{eq:quadmetshift}, we can write 

$$
\ambf = c_0\fmbf_0 - \sum_{j=1}^k o_j\Bmbf[:, j],
$$
where $c_0 = 1/f_{10}$. Since $\ombf$ is the rate achieved by random classifier, $o_j = 1/k \; \forall j \in [k]$, and thus we have
$$
\frac{\partial \ambf}{\partial \fmbf_0} = c_0\Imbf \quad \text{and} \quad \frac{\partial \ambf}{\partial \Bmbf[:, j]} = \frac{1}{k}\Imbf.
$$
Thus,
\begin{align*}
\Vert \ambf - \ambfhat \Vert_2 &\leq c'_0 \sqrt{k}\left(\epsilon + \sqrt{\varrho + \epsilon_\Omega/\varrho}\right) + \frac{1}{k}\sum_{j=1}^k \sqrt{k}\left(\epsilon + \sqrt{\varrho + \epsilon_\Omega/\varrho}\right) \\
& =c'_0 \sqrt{k}\left(\epsilon + \sqrt{\varrho + \epsilon_\Omega/\varrho}\right) + \frac{1}{k}\sum_{j=1}^k c'_j\sqrt{k}\left(\epsilon + \sqrt{\varrho + \epsilon_\Omega/\varrho}\right) \\
&= O\left(\sqrt{k}\left(\epsilon + \sqrt{\varrho + \epsilon_\Omega/\varrho}\right)\right),
\end{align*}
where $c'_0, c'_j$'s are some Lipschitz constants (bounded due to Assumption~\ref{as:regularity-q}).
% , and we have used the fact that $q = k^2 - k$ in the second step.
\eproof

Notice the trade-off in the elicitation error that depends on the size of the sphere. As expected, when the radius of the sphere $\varrho$ increases, the error due to approximation increases, but at the same time, error due to feedback reduces because we get better responses from the oracle. In contrast, when the radius of the sphere $\varrho$ decreases, the error due to approximation decreases, but the error due to feedback increases.

The following corollary translates our guarantees on the elicited metric to the guarantees on the optimal rate of the elicited metric. This is useful in practice, because the optimal classifier (rate) obtained by optimizing a certain metric is often the key entity for many applications. 

\bcorollary
Let $\phi^{\quadr}$ be the original quadratic metric of the oracle and $\hat\phi^{\quadr}$ be its estimate obtained by the QPME procedure (Algorithm~1). Moreover, let $\rmbf^*$ and $\hat\rmbf^*$ be the mazximizers of $\phi^{\quadr}$ and $\hat\phi^{\quadr}$, respectively. Then,  $\phi^{\quadr}(\rmbf^*) \leq \phi^{\quadr}(\hat\rmbf^*) + O\left(k^2\sqrt{k}\left(\epsilon + \sqrt{\varrho + \epsilon_\Omega/\varrho}\right)\right).$
% Then, if $\vert \phi^{\quadr}(r) - \hat\phi^{\quadr}(r)| <= \epsilon$ for all rates $r$  and some slack $\epsilon$ (as shown in Theorem~\ref{thm:q-me}), it follows that $\phi^{\quadr}(\hat\rmbf^*) <= \phi^{\quadr}(\rmbf^*) + 2\epsilon.$
\ecorollary

\bproof
We first show that if $\vert \phi^{\quadr}(r) - \hat\phi^{\quadr}(r)\vert \leq \epsilon$ for all rates $r$  and some slack $\epsilon$, then it follows that $\phi^{\quadr}(\hat\rmbf^*) \geq \phi^{\quadr}(\rmbf^*) - 2\epsilon.$ This is because:

\begin{align*}
    \phi^{\quadr}(\hat\rmbf^*) &\geq \hat\phi^{\quadr}(\hat\rmbf^*) - \epsilon \qquad\qquad \left(\text{as $\hat\phi^{\quadr}$ approximates $\phi^{\quadr}$}\right)\\
    &\geq \hat\phi^{\quadr}(\rmbf^*) - \epsilon \qquad\qquad \left(\text{as $\hat\rmbf^*$ maximizes $\hat\phi^{\quadr}$}\right) \\
    &\geq \phi^{\quadr}(\rmbf^*) - 2\epsilon \qquad\quad\;\; \left(\text{as $\hat\phi^{\quadr}$ approximates $\phi^{\quadr}$}\right) \numberthis \label{eq:metapx}
\end{align*}

Now, let us derive the trivial bound $\vert \phi^{\quadr}(r) - \hat\phi^{\quadr}(r)\vert$ for any rate $\rmbf$. 

\begin{align*}
    \vert \phi^{\quadr}(r) - \hat\phi^{\quadr}(r)\vert &= \vert \inner{\ambf - \hat \ambf}{\rmbf} + \frac{1}{2}\rmbf^T (\Bmbf - \hat\Bmbf)\rmbf \vert \\
    &\leq \vert \inner{\ambf - \hat \ambf}{\rmbf} \vert + \frac{1}{2}\vert \rmbf^T (\Bmbf - \hat\Bmbf)\rmbf \vert\\
    &\leq \Vert \ambf - \ambf \Vert_2 \Vert \rmbf\Vert_2 + \frac{1}{2}\Vert \Bmbf - \Bmbf \Vert_2 \Vert \rmbf\Vert_2^2\\
    &\leq \Vert \ambf - \ambf \Vert_2 \sqrt k + \frac{1}{2}\Vert \Bmbf - \Bmbf \Vert_F k\\
    &\leq O\left(k^2\sqrt{k}\left(\epsilon + \sqrt{\varrho + \epsilon_\Omega/\varrho}\right)\right), \numberthis \label{eq:metapx2}
\end{align*}
where in the fourth step, we have used the fact that the rates are bounded in $[0, 1]$; hence $\Vert \rmbf \Vert_2 \leq \sqrt{k}$, and in the fifth step, we have used the guarantees from Theorem~\ref{thm:q-me}. Combining\eqref{eq:metapx} and~\eqref{eq:metapx2} gives us the desired result. 
\eproof

% \btheorem
% For any $\epsilon > 0$, at least $\Omega(q^2\log(1/q\sqrt q\epsilon))$ pairwise queries are needed to 
% to elicit a quadratic metric $\phi^{\quadr}$ (Def.\ \ref{def:quadmet})
% to an error tolerance of $q\sqrt q\epsilon$.
% \etheorem

\bproof[Proof of Theorem~\ref{thm:lb}]
For the purpose of this proof, let us replace $\left(\epsilon + \sqrt{\varrho + \epsilon_\Omega/\varrho}\right)$ by some slack $\epsilon$. Theorem 1 guarantees that after running the QPME procedure for $O(k^2\log(1/\epsilon)$ queries, we have 

\begin{itemize}
\item $\norm {a - \hat a}_2 \leq O(\sqrt k \epsilon)$
\item $\norm {B - \hat B}_F \leq O(k\sqrt k\epsilon).$
\end{itemize}

If we vectorize the tuple $(\ambf, \Bmbf)$ and denote it by $w$, we have $\norm{w - \hat w}_2 \leq O(k\sqrt k\epsilon)$, where both $\Vert w\Vert_2, \Vert \hat w\Vert_2=1$, due to the scale invariance condition from Definition~\ref{def:quadmet}. Note that $w$ is $\frac{k^2 + 3k}{2}$-dimensional vector and defines the scale-invariant quadratic metric elicitation problem. 
Now, we have to count the minimum number of $\hat w$ that are possible such that $\norm{w - \hat w}_2 \leq O(k\sqrt k\epsilon)$.
% , where both $\Vert w \Vert_2, \Vert \hat w \Vert_2 = 1$.


This translates to finding the covering number of a ball in $\Vert \cdot \Vert_2$ norm with radius 1, where the covering balls have radius $k\sqrt k\epsilon$. Let us denote the cover by $\{u_i\}_{i=1}^N$ and the ball with radius 1 as $\Bmbb$. We then have:

\begin{align*}
Vol(\Bmbb) &= \leq \sum_{i=1}^N Vol(k\sqrt k\epsilon \Bmbb + u_i) \\
&= NVol(k\sqrt k\epsilon \Bmbb) \\
&= (k\sqrt k\epsilon)^{\frac{k^2 + 3k}{2} - 1}.
\end{align*}

Thus the number of $\hat w$ that are possible are at least 
$$
c\left(\frac{1}{k\sqrt k\epsilon}\right)^{\frac{k^2 + 3k}{2} - 1} \leq N,
$$
where $c$ is a constant. Since each pairwise comparison provides at most one bit, at least $O(k^2)\log(\frac{1}{k\sqrt k\epsilon})$ bits are required to get a possible $\hat w$. We require $O(k^2)\log(\frac{1}{\epsilon})$ queries, which is near-optimal barring log terms. 
\eproof

% \section{Extension to Higher Order Polynomials}
% \label{append:sec:poly}

% % \textbf{Extension to higher-order polynomials.}\ 
% Our approach can be generalized to \textit{higher-order polynomials} of rates. 
% %We  can extend the approach used in QPME procedure for eliciting higher order polynomial of rates. 
% Consider e.g.\ a cubic polynomial:
% \begin{align*}
% % \vspace{-0.7cm}
% % \textstyle
%     \phi^{\text{cubic}}(\rmbf)\coloneqq \sum_{i}a_ir_i + \frac{1}{2}\sum_{i,j}B_{ij}r_ir_j + \frac{1}{6}\sum_{i,j,l}C_{ijl}r_ir_jr_l,
% \end{align*}
% % \vskip -0.3cm
% where $\Bmbf$ and $\Cmbf$ are symmetric, and $\sum_i a_i^2 +\sum_{ij} B_{ij}^2 + \sum_{ijl} C_{ijl}^2 = 1$ (w.l.o.g., due to scale invariance).  A quadratic approximation to this metric around a point $\zmbf$ is given by: 
% $$\sum_{i}a_ir_i + \frac{1}{2}\left(\sum_{i,j}B_{ij}r_ir_j + \sum_{i,j,l}C_{ijl}(r_i - z_i)(r_j - z_j)z_l\right) + c,$$ 
% where $c$ is a constant not affecting the oracle responses. We can estimate the parameters of this  approximation by applying the QPME procedure from Algorithm~1 with the metric centered at an appropriate point, and its queries restricted to a small neighborhood around $\zmbf$. Running QPME once using a sphere around the point $\zmbf_l = \ombf + (\varrho - \varrho')\alphambf_l$, where $\varrho' < \varrho$ will elicit one face of the tensor $\Cmbf_{[:, :, l]}$ upto a scaling factor. Thus, it will require us to run the QPME procedure $k$ times around the basis points $\zmbf_l = \ombf + (\varrho - \varrho')\alphambf_l \; \; \forall l \in [k]$. Since we elicit scale-invariant quadratic approximation, we would need additional run of QPME procedure around the point $\Scal_{-\zmbf_1}$ to elicit all the coefficients. Thus, we can recover the metric $\hat{\phi}^{\text{cubic}} = (\hat{\ambf}, \hat{\Bmbf},\hat{\Cmbf})$ with as many queries as the number of unknowns, i.e, $\tilde O(k^3)$ in the cubic case. 

% For a $d$-th order polynomial, one can recursively apply this procedure to estimate $(d-1)$-th order approximations at multiple points, and similarly derive the polynomial coefficients from the estimated local approximations.

\section{\textsc{Extended Experiments}}
\label{append:sec:extexp}

The source code is provided along with the supplementary material. The experiments in this paper were conducted on a machine with the following configuration: \emph{2.6 GHz 6 code Intel i7 processor with 16GB RAM.} 

\subsection{More Details on Simulated Experiments on Quadratic Metric Elicitation} 
% (Section~\ref{sec:experiments})}
\label{append:ssec:details}

\textbf{Number of queries.} First, we look at the number of queries that were actually required to elicit quadratic and fair (quadratic) metrics in Section~\ref{sec:experiments}. Recall that the QPME procedure (Algorithm~1) requires running the LPME subrotuine (Algorithm~\ref{alg:slme}) $k+2$ times. As discussed in Appendix~\ref{append:sec:slme}, each run of LPME requires at most $3 \times 3 \times k \log (\pi/2\epsilon)$ queries. So, the maximum number of queries required in eliciting a quadratic metric is $(k+2)\times 3 \times 3 \times k \times 8$ for a binary search tolerance $\epsilon=10^{-2}$, where we vary $k \in \{2, 3, 4, 5\}$ in the experiments. 

However, note that, the elicitation error shown in Figure~\ref{fig:recovery} is averaged over 100 simulated oracles, each one with its own simulated quadratic metric. Due to the nature of the binary search involved in the LPME subroutine (see Algorithm~\ref{alg:slme} and Figure~\ref{append:fig:shrink1}), not every reduction of the search interval requires three queries. Many times the interval can be shrunk in less than three queries. The actual number of queries may vary across the oracles. The number of queries averaged over the 100 oracles corresponding to experiments in Section~\ref{sec:experiments} is shown in Table~\ref{tab:numqueries}. 

Theorem~\ref{thm:lb} shows that our query complexity (which is linear in the number of unknowns) matches the lower bound for the problem, which means that it is theoretically impossible to obtain a better complexity order for our problem setup. In practice, it can be considerably reduced by making reasonable assumptions on the metric. 
For example, when the number of classes is large, one may just cluster the classes from error perspective. For example, one may assume same error costs for similar classes. This will reduce the number of unknowns to $O(c^2)$, where $c<<k$ is the number of cluster of classes.

\begin{table}[t]
    \centering
    \begin{tabular}{|c|c|}
    \hline
        \multicolumn{2}{|c|}{Number of queries for QPME}\\\hline
         \multirow{2}{*}{$k$} &  \\ 
         &  \\
         \hline
         2 & 265.43\\
         3 & 669.29\\
         4 & 1205.91\\
         5 & 1879.74\\
         \hline
    \end{tabular}
    \text{\;}
    \begin{tabular}{|c|c|c|c|c|}
    \hline
        \multicolumn{5}{|c|}{Number of queries for Fair-QPME}\\\hline
         \diagbox{$k$}{$m$}  & 2 & 3 & 4 & 5  \\ \hline
         2 & 332.10 & 867.65 & 1663.73 & 2738.59 \\
         3 & 796.37 & 2127.44 & 4094.31 & 6734.15\\
         4 & 1398.14 & 3808.67 & 7363.82 & 12180.84\\
         5 & 2130.92 & 5887.99 & 11454.18 & 18999.71\\
         \hline
    \end{tabular}
    \caption{Number of queries required for eliciting regular quadratic metrics (Def.~\ref{def:quadmet}) and fairness quadratic metrics (Def.~\ref{def:f-linmetric}) in Section~\ref{sec:experiments}. The number of predictive rates and sensitive groups are denoted by $k$ and $m$, respectively. Recall that a quadratic metric has $O(k^2)$ unknowns. We see that the number of queries is of order $O(k^2)$ for the quadratic metric in rates, and additionally, $O(m^2k^2)$ for the fair (quadratic) metrics. Theorem~\ref{thm:lb} shows that one cannot improve on this query complexity. Nonetheless, one may make more structural assumptions on the metric to bring down the number of queries in practice.}
    \label{tab:numqueries}
\end{table}

\textbf{Comparison to a baseline.} In Figures~\ref{fig:q_rec_a}--\ref{fig:q_rec_B}, we show box plots of the $\ell_2$ (Frobenius) norm between the true and elicited linear (quadratic) coefficients. 
We  generally find that QPME is able to elicit metrics  close to the true ones.

\begin{figure*}[h]
	\centering 
	\subfigure{
		{\includegraphics[width=5cm]{plots/qme_a_baseline.pdf}}
		\label{fig:rec_a}
	}\quad\quad
	\subfigure{
		{\includegraphics[width=5cm]{plots/qme_B_baseline.pdf}}
		\label{fig:rec_l}
	}
	\caption{Elicitation error in comparison to a baseline which assigns equal coefficients.}
	\label{append:fig:baseline}
\end{figure*}

To reinforce this point, we also compare the elicitation error of the QPME procedure and the elicitation error of a baseline which assigns equal coefficients to $\ambf$ and $\Bmbf$ in Figure~\ref{append:fig:baseline}. We see that the elicitation error of the baseline is order of magnitude higher than the elicitation error of the QPME procedure. This holds for varying $k$ showing that the QPME procedure is able to elicit oracle's multiclass quadratic metrics very well. 

\textbf{Effect of Assumption~\ref{as:regularity-q}}. 
The larger standard deviation for $k=5$ in Figure~\ref{fig:recovery} is due to Assumption~\ref{as:regularity-q} failing to hold with sufficiently large constants $c_{0},c_{-1}, c_1 \ldots, c_q$ in a small number of trials and the resulting estimates not being as accurate. 
% We mentioned in Section \ref{sec:experiments} that in a small number of trials,  Assumption~\ref{as:regularity-q} failed to hold with sufficiently large constants $c_{0},c_{-1}, c_1 \ldots, c_q$.
We now analyze in greater detail the effect of this regularity assumption in eliciting quadratic metrics and understand how the lower bounding constants %in Assumption~\ref{as:regularity-q}  
impact the elicitation error. Assumption~\ref{as:regularity-q} effectively ensures that the ratios computed in~\eqref{eq:poly2elicitamatfinal} are well-defined. To this end, we generate two sets of 100 quadratic metrics. One set is generated following Assumption~\ref{as:regularity-q} with one coordinate in the gradient being greater than $10^{-2}$, and the other is generated randomly without any regularity condition. For both sets, we run QPME and elicit the corresponding metrics. 

\begin{figure*}[h]
	\centering 
	\subfigure{
		{\includegraphics[width=5cm]{plots/qme_a_append.pdf}}
		\label{fig:rec_a}
	}\quad\quad
	\subfigure{
		{\includegraphics[width=5cm]{plots/qme_B_append.pdf}}
		\label{fig:rec_l}
	}
	\caption{Elicitation error for metrics following Assumption~\ref{as:regularity-q} vs elicitation error for completely random metrics.}
	\label{append:fig:regassump}
\end{figure*}

In Figure~\ref{append:fig:regassump}, we see that the elicitation error is much higher when the regularity Assumption~\ref{as:regularity-q} is not followed, owing to the fact that the ratio computation in~\eqref{eq:poly2elicitamatfinal} is more susceptible to errors when gradient coordinates approach zero in some cases of randomly generated metrics. The dash-dotted curve (in red color) shows the trajectory of the theoretical bounds with increasing $k$ (within a constant factor). In Figure~\ref{append:fig:regassump}, we see that  the mean of $\ell_2$ (analogously, Frobenius) norm better follow the theoretical bound trajectory in the case when regularity Assumption~\ref{as:regularity-q}
 is followed by the metrics.

We next analyze the ratio of estimated fractions to the true fractions used in~\eqref{eq:poly2elicitamatfinal} over 1000 simulated runs. Ideally, this ratio should be 1, but as we see in Figure~\ref{append:fig:ratio}, these estimated ratios can be off by a significant amount for a few trials when the metrics are generated randomly. The estimated ratios, however, are more stable under Assumption~\ref{as:regularity-q}. Since we multiply fractions in~\eqref{eq:poly2elicitamatfinal}, even then we may observe the compounding effect of fraction estimation errors in the final estimates. Hence, we see for $k=5$ in Figure~\ref{fig:q_rec_a}-\ref{fig:q_rec_B}, the standard deviation is high due to few trials where the lower bound of $10^{-2}$ on the constants in Assumption~\ref{as:regularity-q}  may not be enough. However, majority of the trials as shown in Figure~\ref{fig:q_rec_a}-\ref{fig:q_rec_B} and Figure~\ref{append:fig:baseline} incur low elicitation error. 

\begin{figure*}[t]
	\centering 
	\subfigure{
		{\includegraphics[width=4cm]{plots/frac_err_k=4_wf=True.png}}
		\label{fig:rec_l}
	} 
	\subfigure{
		{\includegraphics[width=4cm]{plots/frac_err_k=4_wf=False.png}}
		\label{fig:rec_a}
	} \\
	\subfigure{
		{\includegraphics[width=4cm]{plots/frac_err_k=5_wf=True.png}}
		\label{fig:rec_l}
	}
	\subfigure{
		{\includegraphics[width=4cm]{plots/frac_err_k=5_wf=False.png}}
		\label{fig:rec_a}
	}
	\caption{Ratio of estimated to true fractions over 1000 simulated runs with and without Assumption~\ref{as:regularity-q}.}
	\label{append:fig:ratio}
\end{figure*}


\subsection{Ranking of Real-World Classifiers}
\label{append:ssec:ranking}

Performance metrics provide quantifiable scores to classifiers. This score is then often used to rank classifiers and select the best set of classifiers in practice. In this section, we discuss the benefits of elicited metrics in comparison to some default metrics while ranking real-world classifiers. 

\begin{table}[t]
\centering
\caption{Dataset statistics}
\begin{tabular}{|c|ccc|}
\hline
\textbf{Dataset} & $k$ & \textbf{\#samples} & \textbf{\#features} \\ 
\hline
default        & 2   & 30000           &    33    \\
adult        &  2  &    43156       &    74   \\
sensIT Vehicle        &  3  & 98528          &     50   \\
covtype        &  7 &    581012       &     54 \\ 
\hline
\end{tabular}
\label{append:tab:stats}
\end{table}

\textbf{Ranking in case of quadratic metrics:} For this experiment, we work with four real world datasets with varying number of classes $k\in \{2,3, 7\}$. See Table~\ref{append:tab:stats} for details of the datasets. We use 60\% of each dataset to train classifiers. The rest of the data is used to compute (testing) predictive rates. For each dataset, we create a pool of 80 classifiers by tweaking hyper-parameters in some famous machine learning models that are routinely used in practice. Specifically, we create 20 classifiers each from logistic regression models~\citep{kleinbaum2002logistic}, multi-layer perceptron models~\citep{pal1992multilayer},  LightGBM models~\citep{ke2017lightgbm}, and support vector machines~\citep{joachims1999svmlight}. 
% and fairness constrained optimization based models~\cite{narasimhan2019optimizing}. 
We compare ranking of these 80 classifiers provided by competing baseline metrics with respect to the ground truth ranking, which is provided by the oracle's true metric. 

We generate a random quadratic metric $\phi^{\text{quad}}$ following Definition~\ref{def:quadmet}. We treat the true $\phi^{\text{quad}}$ as oracle's metric. It provides us the ground truth ranking of the classifiers in the pool. We then use our proposed procedure QPME (Algorithm~1) to recover the oracle's metric. For comparison in ranking of real-world classifiers, we choose two linear metrics that are routinely employed by practitioners as baselines. The first is accuracy $\phi^{acc} = 1/\sqrt{k}\inner{\bm{1}}{\rmbf}$, and the second is weighted accuracy, where we just use the linear part  $\inner{\ambf}{\rmbf}$ of the oracle's true quadratic metric $\inner{\ambf}{\rmbf} + \frac{1}{2}\rmbf^T\Bmbf\rmbf$. We repeat this experiment over 100 trials. 

\begin{figure*}[h]
	\centering % not necessary
	%	\subcaptionbox{P(y|x), a = 2}%
	%[.3\linewidth]%
	\subfigure{
		{\includegraphics[width=5cm]{plots/rank_ndcg.png}}
		\label{fig:rec_B}
		%		\caption{a = 1}
	}\quad\quad
	\subfigure{
		{\includegraphics[width=5cm]{plots/rank_kdtau.png}}
		\label{fig:rec_l}
		%		\caption{a = 1}
	}
% 	\vskip -0.4cm
	\caption{Performance of competing metrics while ranking real-world classifiers. `elicited' is the metric elicited by QPME, `linear' is the metric that comprises only the linear part of the oracle's true quadratic metric, and `accuracy' is the linear metric which weigh all classification errors equally (often used in practice).}
% 	\caption{Elicitation error in recovering the oracle's metric: (left) $\Vert \ambfbar - \ambfhat \Vert_2$, (centre) $\Vert \text{vec}(\Bmbfbar) - \text{vec}(\Bmbfhat) \Vert_2$, (right) $\vert \lambdabar - \lambdahat \vert$.}
	\label{append:fig:ranking}
% 	\vskip -0.45cm
\end{figure*}

We report NDCG (with exponential gain)~\citep{valizadegan2009learning} and Kendall-tau coefficient~\citep{shieh1998weighted} averaged over the 100 trials in Figure~\ref{append:fig:ranking}. We observe consistently for all the datasets that the elicited metrics using the QPME procedure achieve the highest possible NDCG and Kendall-tau coefficient of 1. As we saw in Section~\ref{sec:guarantees}, QPME may incur elicitation error, and thus the elicited metrics may not be very accurate; however, Figure~\ref{append:fig:ranking} shows that the elicited metrics may still achieve near-optimal ranking results. This implies that when given a set of classifiers, ranking based on elicited metric scores align most closely to true ranking in comparison to ranking based on default metric scores. Consequentially, the elicited metrics may allow us to select or discard classifiers for a given task. This is advantageous in practice. 
For the \emph{covtype} dataset, we see that the \emph{linear} metric also achieves high NDCG values, so perhaps ranking at the top is quite accurate; however Kendall-tau coefficient is low suggesting that the overall ranking of classifiers is poor. We also observe that, in general, the weighted version (\emph{linear} metric) is better than \emph{accuracy} while ranking classifiers.

\textbf{Ranking in case of fair (quadratic) metrics:} With regards to fairness, we performed a similar experiment as above for comparing fair-classifiers' ranking on Adult and Default datasets with gender as the protected group. There are two genders provided in the datasets, i.e., $m=2$. We simulate fairness metrics as given in Definition~\ref{def:f-linmetric} that gives ground-truth ranking of classifiers and evaluate the ranking by the elicited (fair-quadratic) metric using the procedure described in Section 4 (also depicted in Figure~\ref{fig:fairness-workflow}). In Table~\ref{append:tab:rankingfpme}, we show the NDCG and KD-Tau values for our method and for three baselines: (a) `Linear with no fairness', which is the metric that comprises only the linear part of the oracle's true quadratic fair metric from Definition~\ref{def:f-linmetric} without the fairness violation, (b) `Accuracy with eq. odds' is the metric which weigh all classification errors and fairness violations equally, and (c) Fair Performance Metric Elicitation (FPME) procedure from~\citep{hiranandani2020fair}.\footnote{While FPME~\citep{hiranandani2020fair} does not elicit a quadratic metric, one can still compare the elicited metrics based on how they rank candidate classifiers on real-world data.} We again see that the ranking by the metric elicited using the proposed fair-QPME procedure (Section~\ref{sec:fairme}) is closest to the ground-truth ranking.  The metric elicited by FPME~\citep{hiranandani2020fair} ranks classifiers better than ‘Linear with no fairness’ and ‘Accuracy with equalized odds’; however, it is beaten by the proposed fair-QPME procedure.

\begin{table}[h]
    \centering
    \begin{tabular}{|c|c|c|c|c|}
    \hline
    \textbf{Dataset}\;\;$\rightarrow$  & \multicolumn{2}{|c|}{\textbf{Adult}} &\multicolumn{2}{|c|}{\textbf{Default}} \\
    \hline
    \textbf{Method}$\;\;\downarrow$, \textbf{Ranking Measures}\;\;$\rightarrow$ & \textbf{NDCG} & \textbf{KD-TAU} & \textbf{NDCG} & \textbf{KD-TAU}\\
    \hline
         Linear with no fairness & 0.9875 & 0.5918 & 0.9994 & 0.9057 \\
         Accuracy with equalized odds & 0.9857 & 0.3763 & 0.9889 & 0.4953\\
         Elicited via FPME~\citep{hiranandani2020fair} & 0.9989  & 0.9611 & 0.9974 & 0.9650\\
         Elicited via Fair-QPME (Proposed) & \textbf{1.0000} & \textbf{0.9972} & \textbf{1.0000} & \textbf{0.9981}\\
         \hline
    \end{tabular}
    \caption{Performance of competing metrics while ranking real-world classifiers for fairness. `Linear with no fairness' is the metric that comprises only the linear part of the oracle's true quadratic fair metric from Definition~\ref{def:f-linmetric} without the fairness violation, `Accuracy with eq. odds' is the metric which weigh all classification errors and fairness violations equally (often used in practice), `Elicited via FPME~\citep{hiranandani2020fair}' is the metric elicited using the procedure from~\cite{hiranandani2020fair},  `Elicited via Fair-QPME' is the metric elicited by the proposed (quadratic) fairness metric elicitation procedure from Section 4 (also depicted in Figure~\ref{fig:fairness-workflow}),}
    \label{append:tab:rankingfpme}
\end{table}


\textbf{Ranking in case of added structural assumptions on the metrics:} Lastly, we discuss an experiment where we show how one may make structural assumptions on the metric when the \emph{actual} number of unknowns is large and still get comparable results in practical settings. For this experiment, we assume that the oracle's true metric is quadratic in general rate entries as explained in Appendix~\ref{append:generalquad}. Thus, the number of unknowns is $O(q^2)$, where $q = k^2 - k$ and is the number of off-diagonal entries of the rate matrix. We can apply the QPME procedure as it is and elicit a quadratic metric in general rates with $O(q^2)$ queries (see Appendix~\ref{append:generalquad}), since there are $O(q^2)$ unknowns. 

Note that, even if the oracle's original metric is a quadratic metric in off-diagonal entries, as a heuristic, we could still use our procedure to elicit a quadratic metric in diagonal rate entries. Moreover, we can use LPME procedure (Appendix~\ref{append:sec:slme}), too, to elicit a linear metric in off-diagonals and diagonal rate entries depending on the assumption we make on the metric.

Thus, for the ranking based experiments explained in Figure~\ref{append:fig:ranking},  we additionally ran (a) linear elicitation with diagonal rates, (b) linear elicitation with general rates, and (c) quadratic elicitation with diagonal rates, and compare their ranking with the elicited quadratic metric in general rates.
As seen in Table~\ref{append:tab:apxranking}, the quadratic approximation in the diagonal rates performs significantly better than eliciting a linear approximation in the general rates, while requiring the same query complexity ($\tilde{O}(k^2)$), and is close to the elicited quadratic metric in general rates, which require ($\tilde{O}(k^4)$) queries. Hence, one can make structural assumptions on the metric to reduce the query complexity and still get comparable results in practice. 

\begin{table}[h]
    \centering
    \scriptsize{
    \setlength\tabcolsep{3pt}
    \begin{tabular}{|c|c|c|c|c|}
    \hline
    \textbf{Dataset}\;\;$\rightarrow$  & \multicolumn{2}{|c|}{\textbf{Adult dataset}} &\multicolumn{2}{|c|}{\textbf{Default dataset}} \\
    \hline
    \textbf{Elicited Metric}$\downarrow$, \textbf{Rank- Measure}$\rightarrow$ & \textbf{NDCG} & \textbf{KD-TAU} & \textbf{NDCG} & \textbf{KD-TAU}\\
    \hline
         Linear-diagonal ($\tilde{O}(k)$) & 0.9783 & 0.6053 & 0.9790 & 0.4536 \\
         \highlight{Linear-general ($\tilde{O}(k^2)$)} & \highlight{0.9908} & \highlight{0.7713} & \highlight{0.9863} & \highlight{0.6216}\\
         \highlight{Quadratic-diagonal ($\tilde{O}(k^2)$)} & \highlight{0.9968} & \highlight{0.9611} & \highlight{1.0000} & \highlight{0.9979}\\
         Quadratic-general ($\tilde{O}(k^4)$) & \textbf{1.0000} & \textbf{0.9986} & \textbf{1.0000} & \textbf{0.9979}\\
         \hline
    \end{tabular}
    }
    \vskip -0.2cm
    \caption{Performance on ranking real-world classifiers when the oracle's true metric is quadratic in general rate entries: the elicited \emph{quadratic metric in diagonal entries perform better than elicited linear metric} in general rate entries (while requiring same no.\ of queries), and close to the elicited quadratic metric in general rates.
    }
    \vskip -0.2cm
      \label{append:tab:apxranking}
\end{table}



\section{Extended Related Work}
\label{append:sec:relwork}

We discuss how the area of metric elicitation, in general, and our  quadratic elicitation proposal, in particular, differ from two related fields: (i) inverse reinforcement learning and (ii) ranking from pairwise comparisons using choice models. 

\subsection{Inverse reinforcement learning (RL)} 
The idea of learning a reward/cost function in the inverse RL problems is conceptually similar to metric elicitation. However, there are many key differences. Studies such as~\citep{ng2000algorithms, wu2020efficient, abbeel2004apprenticeship} try to learn a linear reward function either by knowing the optimal policy or expert demonstrations. Not only is the type of feedback in these studies different from the pairwise feedback we handle, but the studies are focused on linear rewards. In contrast, our goal is to use pairwise feedback to elicit quadratic metrics which are important for classification problems, especially, fairness. Indeed nonlinear reward estimation in inverse RL problems has been tackled before~\citep{levine2011nonlinear, fu2017learning}, but these are passive learning approaches and do not come with query complexity guarantees like we do. Because of the use of a complex function class, these methods are not easy to analyze. There has been some work on actively estimating the reward function in the inverse RL problems~\citep{lopes2009active}; this work involves discretizing the feature space and using maximum entropy based ideas to elicit a distribution over rewards, which clearly uses a different set of modeling assumptions than us. A recent work~\citep{sadigh2017active} elicits reward functions through active learning, but is again tied to eliciting linear functions and provides limited theoretical guarantees; whereas, we specifically focus on quadratic elicitation with rigorous guarantees.

In summary, our work is significantly different from inverse RL methods, in that unlike them, we are tied to a particular geometry of the query space (the space of error statistics achieved by feasible classifiers), and elicit quadratic (or polynomial) functions from pairwise comparisons, specifically, in an active learning manner. 

\subsection{Ranking from pairwise comparisons} 
Our work  is also quite different from the use of choice models such the Bradley-Terry-Luce (BTL)  model for rank aggregation. (i) Firstly, choice models such as BTL are commonly used to learn a aggregate global ranking of a finite set of  $N$ items from pairwise comparisons. The underlying problem involves estimating a $N$-dimensional quality score vector for the items~\citep{shah2015estimation}. In contrast, metric elicitation estimates an oracle’s classification metric, a function of a classifier’s error statistics. The applications for the two problems are very different: while ranking aggregation strategies using BTL are often prescribed for aggregating user opinions on a restaurant or product, metric elicitation seeks to find the right objective to optimize for a classification task. 
(ii) Secondly, the noise model in  BTL  is stochastic and depends on the distance between the quality of items; whereas, our noise model in Definition~\ref{def:noise} is not stochastic and is oblivious to the distance of rates unless the rates are very close.
(iii) Thirdly, while there is some work on the extended BTL model where items are represented by feature vectors~\citep{niranjan2017inductive}, and the goal is to learn weights on the features to complete the ranking of the items, most of the work in this area considers a passive setting, where pairwise comparisons are assumed to be iid. In contrast, our work involves actively learning nonlinear utilities with theoretical bounds on query complexity.
(iv) Lastly, the closest active learning work we could find with BTL models \citep{mohajer2017active} does not generalize to feature-dependent utilities and is proposed for finding the top-$k$ items, which is entirely different from metric elicitation.
% \end{appendices}
~\\[-5pt]

Other fields that are less closely related to our work include learning scoring functions for supervised label ranking problems \citep{furnkranz2010preference}, and the more traditional metric learning literature, where the task is to learn a distance metric that captures similarities between data points, with the goal of using it for downstream learning tasks \citep{kulis2013metric}.

\section{Preliminary User Study}
\label{append:userstudy}

We are actively conducting user studies for eliciting performance metrics. In this section, we provide a peek into our future work. The goal of this preliminary study is to check workflow of the practical implementation of the metric elicitation framework with real data, and to a certain extent, support or reject the hypothesis that the implicit user preferences can be quantified using the pairwise comparison queries over confusion matrices or predictive rates. In addition, the goal includes testing certain assumptions regarding the noise in the subject's (oracle's) responses, work around with finite samples, eliciting actual performance metrics in real-life scenarios, and evaluating the quality of the recovered metric. 

The following user study works with the space of confusion matrices, i.e., entries of the type $\Pmbb(Y=i, h=j)$ for $i,j\in[k]$, instead of the predictive rates. In the future, we plan on incorporating rates, i.e., entries of the type $\Pmbb(h=j|Y=i)$ for $i,j\in[k]$, in the visualizations as well. Our contributions are summarized as follows:

\begin{itemize}
    \item We create a web UI that uses existing visualizations of confusion matrices (predictive rates) that are refined to capture preferences over pairwise comparisons. 
    \item The UI implements the binary-search procedure from Algorithm~\ref{alg:slme} at the back end that make use of the real-time responses over confusion matrices to elicit a linear performance metric for a binary classification task. 
    \item We perform a user study with ten subjects and elicit their linear performance metrics using the proposed web UI. We compare the quality of the recovered metric  by comparing their responses to the elicited metric's responses over a set of randomly chosen pairwise comparison queries. 
\end{itemize}

\subsection{Choice of Task and Dataset Used}
\label{pme-ssec:dataset}

Our choice of task is \emph{cancer diagnosis}~\citep{yang2014multiclass} for which we use the Breast Cancer Wisconsin (Original) dataset from the UCI repository.\footnote{The dataset can be downloaded from https://tinyurl.com/dn2esyvw.} The dataset has been extensively used in the literature for binary classification, where the label $1$ denotes \emph{malignant} cancer and label $0$ denotes \emph{benign} cancer. There are 699 samples in total, wherein each sample has 9 features. The task for any classifier is to take the 9 features of a patient as input and predict whether or not the patient has cancer. We divide this data into two equally sized parts -- the training and the test data. Using the training data, we learn a logistic regression model to obtain an estimate of the class-conditional probability, i.e., $\hat\eta(x) = \hat\Pmbb(y=1 | X)$. We then create a sphere using Algorithm~\ref{alg:sphere} inside the space of confusion matrices computed on the test data . 

\subsection{Choice of Visualization}
\label{pme-ssec:vis}

In modern times, ensuring effective public understanding of algorithmic decisions, especially, machine learning models has become an imperative task. With this view in mind, we borrow the visualizations of confusion matrices for the binary classifications setup from~\cite{shen2020designing}. The authors provide a concrete step towards the above goal by redesigning confusion matrices to support non-experts in understanding the performance of machine learning models. The final visualizations that we use from~\cite{shen2020designing} are created over multiple iterative user-studies. The visualizations are shown in Figure~\ref{pme-fig:vis-prior} in the context of a recidivism prediction task. One is the \emph{flow-chart}, which helps users in understanding the direction of the data, and the other is the \emph{bar chart}, which helps users in understanding the quantities involved.

\begin{figure}[t]
    \centering
    \includegraphics[scale=0.5]{plots/vis-prior.PNG}
    \caption{Flow-chart and bar-chart based visualizations for (binary classification) confusion matrices in the recidivism prediction task from~\cite{shen2020designing}.}
    \label{pme-fig:vis-prior}
    \vspace{-0.5cm}
\end{figure}

However, in light of our preliminary discussions with Human-Computer Interaction (HCI) and machine learning researchers, we make/recommend the following changes in the visualization for pairwise comparison purposes in the metric elicitation framework.

\begin{enumerate}
    \item Based on the observation that multiple visualizations of the information help in better user understanding, we choose to use both \emph{flow-chart} and \emph{bar-chart}, together to depict a confusion matrix. 
    \item We transform the data statistics so that the numbers denote out-of-100 samples. 
    \item We found that the total number of positive and negative labels along with total number of positive and negative predictions are very helpful in comparing two confusion matrices. Therefore, we add the total numbers in the flow-chart boxes and on axes in the bar-charts.
\end{enumerate}
% Our modified visualization incorporating the points above for a confusion matrix in the context of cancer diagnosis is shown in Figure~\ref{pme-fig:me}. 
A sample of a pairwise comparison query with modified visualizations incorporating the points above is shown in Figure~\ref{pme-fig:me}. We next discuss the web user interface. 

\subsection{User Interface}
\label{ssec:ui}

We discuss our proposed web User Interface (UI) in detail and discuss our rationale behind its components. The UI has two parts to it as explained below.

The first phase of the UI is where we actually ask subjects for pairwise preferences over confusion matrices, and implement our binary-search procedure from Algorithm~\ref{alg:slme}. The subjects have to make a choice reflecting on the trade-off between false positives and false negatives. Algorithm~\ref{alg:slme} takes in real-time preferences of the subjects, generates next set of queries based on the current responses, and converge to a linear performance metric at the back end. We save this (linear) performance metric for each subject. We stop the binary-search when the search interval becomes less than or equal to 0.05 ($\epsilon$ in line 1 of Algorithm~\ref{alg:slme}).


\begin{figure}
    \centering
    \includegraphics[scale=0.5]{plots/boots_index1.PNG}
    \includegraphics[scale=0.5]{plots/boots_index2.PNG}
    \caption{A sample of a pairwise comparison query from a run of the binary-search based procedure Algorithm~\ref{alg:slme}}
    \label{pme-fig:me}
\end{figure}

In order to evaluate the quality of the recovered metric, in the second phase, we ask the subjects fifteen pairwise comparison queries, each on a separate web page, right after the binary search algorithm  has converged, and we have elicited the metric. The subjects do not know this information and are shown evaluation queries in continuation to the previous phase (i.e., the binary search). The query comprises of two randomly selected confusion matrices that lie inside the feasible region.  This set of  queries is used to evaluate the effectiveness of the elicited metric. 

\subsection{Study Results}
\label{append:ssec:studyres}
We compute the fraction of times our elicited metric's preferences matches with the subject's preferences on the fifteen queries, i.e., 
\begin{equation}
\Mcal := \frac{\sum_{i=1}^{15} \1[\text{subject's prefer. for query } i == \text{metric's prefer. for query } i]}{15} \times 100.
    \label{pme-eq:fraction}
\end{equation}

We show the elicited metric for the fifteen subjects and the measure $\Mcal$ values in Table~\ref{pme-tab:metrics}. We see for nine out of ten subjects that more than 85\% of the time our elicited metric's preferences matches with the subject's preferences on the fifteen evaluation queries. For three subjects, our metric's preference matches exactly for all the evaluation queries. 

The absolute numbers for the $\Mcal$ measure look good; however, how good they are is still a missing piece in this study because of the lack of a baseline. In future, we plan to devise ways to develop a baseline for the metric elicitation task and compare to that baseline on the measure $\Mcal$.

\begin{table}[h]
    \centering
    \caption{The elicited linear performance metrics for the ten subjects along with the fraction of times (in \%) the elicited metric's preferences matches with the subject's preferences over the fifteen evaluation queries.}
    \begin{tabular}{|c|c|c|}
    \hline
    \textbf{Subjects} & \textbf{Linear Performance Metric} & $\Mcal$ \\
    \hline
         S1 & 0.125 \text{TN} + 0.875 \text{TP}  & 87\\
         S2 & 0.141 \text{TN} + 0.859 \text{TP}  & 100\\
         S3 & 0.125 \text{TN} + 0.875 \text{TP}  & 93\\
         S4 & 0.141 \text{TN} + 0.859 \text{TP}  & 100\\
         S5 & 0.328 \text{TN} + 0.672 \text{TP}  & 73\\
         S6 & 0.031 \text{TN} + 0.969 \text{TP}  & 87\\
         S7 & 0.031 \text{TN} + 0.969 \text{TP}  & 100\\
         S8 & 0.359 \text{TN} + 0.641 \text{TP}  & 87\\
         S9 & 0.125 \text{TN} + 0.875 \text{TP}  & 93\\
         S10 & 0.141 \text{TN} + 0.859 \text{TP}  & 87\\
    \hline
    \end{tabular}
    \label{pme-tab:metrics}
\end{table}
