\section{EXPERIMENTAL RESULTS}
In this part, we present simulations and experiments on real datasets to validate our theoretical results. Table \ref{tab:kernels} presents the kernel functions and the kernel parameters that we used in simulations and experiments. In all of our simulations and experiments, we use CVXPY \citep{diamond2016cvxpy, agrawal2018rewriting} and MOSEK \citep{mosek} to solve the convex program (\ref{opt-P4}). We use the nuclear norm constraint for $\Mb$ in (\ref{opt-P4}).
\begin{table}[]
\centering
\begin{tabular}{@{}lll@{}}
\toprule
Kernel     & Formula                                          & Parameter            \\ \midrule
Linear     & $k(x, y) = x^\top y$                             & N/A                  \\
Gaussian   & $k(x, y) = e^{\frac{-\|x - y\|^2_2}{2\sigma^2}}$ & $\sigma$             \\
Sigmoid    & $k(x, y) = \tanh{(c + \alpha x^\top y )}$        & $c, \alpha$          \\
Polynomial & $k(x, y) = (c + x^\top y)^p$     & $c, p$ \\
Laplacian  & $k(x, y) = e^{\alpha \|x - y\|_1}$               & $\alpha$             \\ \bottomrule
\end{tabular}
\caption{List of kernel functions and parameters used in our simulations and experiments.}
\label{tab:kernels}
\end{table}

To apply these kernel functions efficiently, especially on large datasets, we consider the computational complexity of the Kernelized Principal Component Analysis (KPCA) operation, which is ${O}(n^3)$, where $n$ is the number of items used in queries. To mitigate this cost, one can adapt low-rank approximations of the Gram matrix (Nyström method \citep{reinhardt2012analysis, williams1998prediction}) by randomly sampling $m \ll n$ items from $n$. The Nyström KPCA method \citep{williams2000using} has a complexity of ${O}(nm^2)$. Another approach, the randomly pivoted Cholesky algorithm \citep{chen2025randomly}, requires only ${O}(k^2n)$ kernel evaluations for a rank-$k$ approximation. In our work, we leverage the Nyström KPCA \citep{williams2000using} with $m=500$ to efficiently approximate the Gram matrix.
 
\subsection{Simulations}\label{sec:simulations}
\textbf{Generating Noisy Labels for a Known Distance Function:} We assume an explicit link function $f(\cdot)$, where $f(\cdot)$ generates noisy labels for each triplet following that $y_t=-1$ with probability $p_t$ as a noisy indication of $\text{sign}(d_L^2 (\xb_h, \xb_i)-d_L^2 (\xb_i, \xb_j))$, where
\begin{eqnarray*}
    p_t=f\left(d_L^2 (\xb_h, \xb_i)-d_L^2 (\xb_h, \xb_j)\right).
\end{eqnarray*}
We use $f(x) = {1}/{(1 + e^{\rho x})}$ as the link function, where the parameter $\rho$ controls the noise level.
\begin{figure}
    \centering
    \includegraphics[width=0.8\linewidth]{figures/2dspiral.png}
    \caption{A 2D spiral. We sample triplets uniformly along this curve. The geodesic distance between point $A$ and $B$ is the length of the green curve, whereas the Euclidean distance between the two points is the length of the red line.}
    \label{fig:2dspiral}
\end{figure}
We first consider a spiral shape in 2D. 

\textbf{Spiral with Geodesic Distance:}
We generate triplets uniformly along the spiral. We assume the true distance function is the geodesic distance (see Figure \ref{fig:2dspiral}) along the 2D curve. We provide train and test accuracy for different kernel functions with varying number of triplets. Figure \ref{fig:spiral} illustrates the performance of various kernels. We observe that polynomial, Gaussian, and Laplacian kernels outperform linear and sigmoid kernels. We defer the details of the simulation setting to the Appendix.

\begin{figure}[h]
    \centering
    \includegraphics[width=\linewidth]{figures/spiral.png}
    \caption{Performance of various kernels in the 2D spiral setting. For the Gaussian kernel, we use $\sigma=2$; sigmoid kernel, $c=1, \alpha=1$; polynomial kernel, $c=1, p=2$, Laplacian kernel, $\alpha=1$. For the link function $f$, we use $\rho=30$ to set the noise level around 0.01. We repeat each run 50 times.}
    \label{fig:spiral}
\end{figure}
Next, we assume we have access to a feature map $\phi$ such that $\langle \phi(\bx_i), \phi(\bx_j)\rangle=k(\bx_i, \bx_j)$ with a Gaussian kernel function $k: \mathbb{R}^d\times \mathbb{R}^d \rightarrow \mathbb{R}^1$, where $\sigma=1$.

\textbf{Gaussian Kernel Map:}
We assume there exists a linear functional $L^*:\mathcal{H} \rightarrow \mathcal{H}$ that lies on an $r-$dimensional manifold. In Figure \ref{fig:simulation}, we provide our results with a Gaussian kernel for $r=2$ (see the Appendix for details of data generation and more extensive results). We also defer the details of the simulation setting to the Appendix.

\begin{figure}[h]
    \centering
    \includegraphics[width=\linewidth]{figures/simulation-r=2.png}
    \caption{Train and test accuracy of Gaussian kernel. Here, we use $\sigma=1$. For the link function $f$, we use $\rho=1000$ to set the noise level around 0.05. We repeat each run 50 times.}
    \label{fig:simulation}
\end{figure}
The test accuracy increases as we have more triplets for training in Figure \ref{fig:simulation}. We also observe that, as the number of triplets increases, the train and test accuracy gets close, consistent with our analysis in Theorems \ref{thm:generalization_error_withbounded_Fro_norm} and \ref{thm:generalization_error_withbounded_Nuclear_norm}. Recall that excess risk decreases with more triplets according to Theorems \ref{thm:generalization_error_withbounded_Fro_norm} and \ref{thm:generalization_error_withbounded_Nuclear_norm}.

\subsection{Empirical Evaluation: Food-100 Dataset}

The Food-100 dataset \citep{wilber2014cost} consists of 100 food items and approximately 190,000 triplets based on human responses (See Figure \ref{fig:triplets} for example images from the dataset). We divide this dataset by items to ensure that the model does not encounter some items in the test and validation sets during the training phase. See Appendix for more information on how we split the dataset. We obtain embeddings for each item in Food-100 dataset using the embedding from the antepenultimate layer of AlexNet \citep{krizhevsky2012imagenet}, pretrained on ImageNet \citep{deng2009imagenet}. We, then, project them to a 2D space using PaCMAP \citep{JMLR:v22:20-1061}. Figure \ref{fig:experiment} shows the performance of different kernels, among which the Gaussian kernel performs the best.

\begin{figure}[h]
    \centering
    \includegraphics[width=\linewidth]{figures/experiment.png}
    \caption{Performance of various kernels under the Food-100 dataset. For the Gaussian kernel, we use $\sigma=2$; sigmoid kernel, $c=1, \alpha=0.01$; polynomial kernel, $c=1, p=2$, Laplacian kernel, $\alpha=1$. We repeat the validation 20 times.}
    \label{fig:experiment}
\end{figure}
Theorems \ref{thm:generalization_error_withbounded_Fro_norm} and \ref{thm:generalization_error_withbounded_Nuclear_norm} provide bounds for excess risk. Therefore, our analysis allows us to bound the difference between the true risk and the empirical risk for any kernel choice. Experiments with different kernels demonstrate that train and test accuracies are close, indicating that the empirical risk approximates the true risk well. Choice of kernel has an effect on the true risk and therefore affects the risk achievable by the learned metric. This is reflected in the difference in test accuracies across different kernels. Since there is no way of knowing what the true risk is, cross-validation is an appropriate method for selecting the optimal kernel for the dataset at hand.
