% \vspace{-0.3cm}
\section{Experiments}
\label{sec:experiments}
% \vskip -0.1cm

We evaluate our approach on simulated oracles. Here we present results on a synthetically generated query space and in Appendix \ref{append:ssec:ranking} include results on real-world datasets. 
% We run our elicitation procedures with tolerance $\epsilon = 10^{-2}$.

\textbf{Eliciting quadratic metrics.}\ 
We first apply QPME (Algorithm~1) to elicit quadratic metrics in Definition \ref{def:quadmet}.  Like~\cite{hiranandani2020fair}, we assume access to a $k$-dimensional sphere $\Scal$ centered at rate $\ombf$ with radius $\rho = 0.2$, from which we query rate vectors $\rmbf$. The trends that we will discuss are robust to the sphere radius parameter $\rho$.
Recall that in practice,
%(under a mild distributional assumption)
Remark \ref{rem:sphere} guarantees the existence of such a sphere within the feasible region $\Rcal$. We  randomly generate quadratic metrics $\phi^\quadr$ parametrized by $(\ambf, \Bmbf)$ and repeat the experiment over 100 trials for varying  numbers of classes $k \in \{2,3,4,5\}$. 
We run the QPME procedure with tolerance $\epsilon = 10^{-2}$. In Figures~\ref{fig:q_rec_a}--\ref{fig:q_rec_B}, we show box plots %~\cite{mcgill1978variations} 
of the $\ell_2$ (Frobenius) norm between the true and elicited linear (quadratic) coefficients. 
We  generally find that QPME is able to elicit metrics  close to the true ones.
% even for small $\epsilon = 10^{-3}$. 
This holds for varying $k$, showing the effectiveness of our approach in handling multiple classes. 
The average number of queries we needed for elicitation over the 100 trials is provided in Table~\ref{tab:numqueries} in Appendix \ref{append:sec:extexp}. Note that the number of queries is $\tilde O(d)$ for eliciting a quadratic metric with $d = k^2$ unknowns,
which clearly matches the lower bound in Theorem~\ref{thm:lb}. See Appendix \ref{append:practicality} for a discussion on the practicality of posing the requisite number of queries.

\textbf{Eliciting fairness metrics.}\
We next apply the elicitation procedure in Figure \ref{fig:fairness-workflow} with tolerance $\epsilon= 10^{-2}$ to elicit the fairness metrics in Definition \ref{def:f-linmetric}. We randomly generate oracle metrics $\phi^\fair$ parametrized by $(\ambf, \Bmbb, \lambda)$ and repeat the experiment over 100 trials and with varied  number of classes and groups $k, m \in  \{2,3,4,5\}$. Figures~\ref{fig:f_rec_a}--\ref{fig:f_rec_l} show the mean elicitation errors for the the three parameters. For the linear predictive performance, the error {\small$\Vert \ambf - \ambfhat\Vert_2$} increases only with the number of classes $k$ and not groups $m$, as it is independent of the number of groups. For the quadratic violation term, the error {\small$\sum_{u,v}\Vert \Bmbf^{uv} - \Bmbfhat^{uv} \Vert_F$} increases with both $k$ and $m$. This is because the QPME procedure is run {\small$m\choose 2$} times for eliciting {\small $m \choose 2$} matrices {\small $\{\Bmbf^{uv}\}_{v > u}$}, and so the elicitation error accumulates with increasing $q$. Lastly,  the elicited trade-off {\small $\hat \lambda$} is seen to be close to the true $\lambda$ as well. 

\textbf{Real-world datasets.}\ In App.~\ref{append:ssec:ranking}, we evaluate how well the elicited metric from QPME ranks a set of candidate classifiers trained on real-world datasets. %The results are shown in Appendix~\ref{append:sec:ranking}. 
We find that despite incurring elicitation errors, QPME %as discussed above,
achieves near-perfect ranking; 
% of classifiers; 
whereas, the  baseline metrics fail to do so. 
% achieve good rankings.
% (see Appendix~\ref{append:sec:ranking} for details).