\section{Additional Simulations and Experimental Details}
In the practical implementation of kernelized metric learning problem, our target is to solve convex program (\ref{opt-P4}). Solving (\ref{opt-P4}), we learn a finite metric $\widehat{\Mb}$. 

\textbf{Unseen Triplets: }To evaluate the performance of $\widehat{\Mb}$ for unseen triplets, we, first find $\varphi_{n+1}=\Ab^T[k(\xb_{n+1}, \xb_1), \ldots, k(\xb_{n+1}, \xb_{n})]^T$ for each new point $\bx_{n+1}$ seen in the test set using kernel function $k(x,y)$, where $\Ab$ is from KPCA procedure (see Section \ref{sec:KPCA}). This corresponds to finding the projections of new points to the span of $\phi_1\ldots \phi_n$. Then, we can estimate the label for an unseen triplet using new (finite) representations $\varphi_{n+1}$'s and $\widehat{\Mb}$.

\textbf{Computing Infrastructure: }Our code is designed to run on a personal laptop. The experiment and simulations reported in this paper were conducted on a MacBook Pro with M3 Max CPU with 48GB of RAM. 

We will open-source our code for reproducibility upon acceptance of this work.
\subsection{Spiral with Geodesic Distance:} We present the performance of different kernels in Figure \ref{fig:2dspiral} for the task of metric learning on a 2D spiral, where the true distance is the geodesic distance. Table \ref{tab:kernels} shows parameters of the kernel functions used for this task, which are as follows: $\sigma=1, c=1, \alpha=1, p=2$. 
\subsection{Gaussian Kernel Map}
\textbf{Preliminary: }We want to generate a linear functional $L^*:\mathcal{H} \rightarrow \mathcal{H}$ that lies on an $r-$dimensional manifold. First, note that Riesz's Representation Theorem allows us to represent the linear functional $L^*$ as follows:
\begin{eqnarray*}   L^*\phi=\sum_{k=1}^\infty\langle \phi, \tau_k\rangle_\mathcal{H} \mathbf{e}_k.
\end{eqnarray*}
Given that $L^*$ lies on an $r-$dimensional manifold, each $\tau_k$ can be written as $\sum_{j=1}^rv_{k,j}\psi_j$, where $\{\psi_1, \ldots, \psi_r\}$ is a set of features that span an $r-$dimensional manifold. Therefore, for any $\phi_i, \phi_j$,
\begin{eqnarray}
    \langle L\phi_i, L\phi_j\rangle_\mathcal{H} &=& \sum_{k=1}^\infty\langle \phi_i, \tau_k \rangle_\mathcal{H} \langle \phi_j, \tau_k \rangle_\mathcal{H} \nonumber
    \\ &=& \sum_{a=1}^r\sum_{b=1}^r \left(\sum_{k=1}^\infty v_{k,a}v_{k,b}\right)\langle \phi_i, \psi_a \rangle_\mathcal{H} \langle \phi_j, \psi_b \rangle_\mathcal{H}, 
    \label{Riesz_representationapp}
\end{eqnarray}
where $\Gb_{a,b}=\left(\sum_{k=1}^\infty v_{k,a}v_{k,b}\right)$. Each entry of $\Gb$ is an inner product in $\ell_2$, so $\Gb$ is a positive semidefinite matrix. Our target is to sample a set of features in $\mathcal{H}$ that spans an $r_0-$dimensional manifold, where $r_0=\max{(r)}$ and generate a random psd matrix $\Gb$ to define $L^*$. Inspiring from the simulation setup of \cite{mason2017learning} for linear metric learning problem, we define $\Gb$ as $\Gb=\frac{r_0}{\sqrt{r}}\Ub\Ub^T$ to make average magnitude of entries constant independent from $r$ and $r_0$, where $\Ub\in \mathbb{R}^{r_0\times r}$ is a random orthogonal matrix. This procedure provides a linear functional $L^*$ lying on an $r-$dimensional manifold.  

\textbf{Linear Functional $L^*$: }We sample a set $\{z_1 \ldots z_r\}$, where each $z_i \sim \mathcal{N}(\textbf{0}_d, \frac{1}{d} I_d)$. Then, consider a kernel map $\phi(\cdot)$ such that $\langle \phi(z_i), \phi(z_j) \rangle=k(z_i, z_j)$. We generate corresponding features using this kernel map, where the set of features $\{\phi(z_1)\ldots \phi(z_r)\}$ span an $r-$dimensional manifold in $\mathcal{H}$ and call $\psi_i=\phi(z_i)$. We also generate a random psd matrix $\Gb_{r\times r}$. Finally, we have an explicit formula for $L^*$ based on (\ref{Riesz_representationapp}). Now, we can express inner product $\langle L\phi_i, L\phi_j\rangle_\mathcal{H}$ in terms of known parameters: 
\begin{eqnarray}
     \langle L\phi_i, L\phi_j\rangle_\mathcal{H} = [k(\bx_i, z_1), \ldots, k(\bx_i, z_r)]\Gb [k(\bx_j, z_1), \ldots, k(\bx_j, z_r)]^T, \label{inner product_app}
\end{eqnarray}
where $\langle \phi_i, \psi_a\rangle_\mathcal{H}=k(\bx_i, z_a)$  and $\phi_i=\phi(\bx_i)$. We can easily find the difference of distances for triplet comparisons based on (\ref{inner product_app}), since we have 
$$\|L\phi(\xb_h)-L\phi(\xb_i)\|_\mathcal{H}^2=\langle L\phi_h,L\phi_h\rangle_\mathcal{H}-2\langle L\phi_h,L\phi_i\rangle_\mathcal{H}+\langle L\phi_i,L\phi_i\rangle_\mathcal{H}.$$

\textbf{Triplet Generation: } We randomly sample triples $\{\bx_h,\bx_i,\bx_j\}$ where $\bx_i\sim \mathcal{N}(\textbf{0}_d, \frac{1}{d}I_d)$. Then, we can numerically find the difference of distances using (\ref{inner product_app}) and generate noisy answers for triplets with a link function as mentioned in Section \ref{sec:simulations}.   

\textbf{Accuracy: }We generate another set of random triplets. We can numerically find the true label corresponding to each triplet using $L^*$. Finally, we compare true labels with estimated labels to find accuracy.

Below, we provide more extensive simulations with a Gaussian kernel, where $\sigma=1$.
\begin{figure}[H]
    \centering
    \includegraphics[width=0.5\linewidth]{figures/noiseless_r2to10.png}
    \caption{Train and test accuracy for noiseless setting with 50 repetitions for each run. We fix the number of triplets to 5000. }
    \label{fig:noiseless_r2to10}
\end{figure}
From Figure \ref{fig:noiseless_r2to10}, we observe that given a set number of triplets, the accuracy one can obtain decreases as the rank $r$ increases, as captured by our analysis, where $L^*$ lies on an $r-$dimensional manifold. The task of learning a kernelized metric becomes more complex as r increases.
\begin{figure}[H]
\centering
\begin{minipage}{.5\textwidth}
  \centering
  \includegraphics[width=1\linewidth]{figures/r2noiseless.png}
  %\caption{Train and test accuracy for noiseless setting with 50 repetitions and and $r=2$.}
  %\label{fig:r2noiseless}
\end{minipage}%
\begin{minipage}{.5\textwidth}
  \centering
  \includegraphics[width=1\linewidth]{figures/r10noiseless.png}
  %\caption{Train and test accuracy for noiseless setting with 50 repetitions and and $r=10$.}
\end{minipage}
\caption{Train and test accuracy for noiseless setting with 50 repetitions varying number of triplets (100, 500, 1000, 2500, 5000, 10000), where $r=2$ (left) and $r=10$ (right).}
\label{fig:r10noiseless}
\end{figure}
Figure \ref{fig:r10noiseless} shows that test accuracy increases when the triplet set gets larger. As a result, the learned metric generalizes better. For example, we observe that, to obtain the same accuracy of $70\%$, $\sim 1000$ triplets are sufficient when rank is 2, whereas the triplets needed when rank is 10 is $\sim 5000$.

Next, we provide simulation results with noisy responses. From Figure \ref{fig:noisy_r2to10}, we observe that accuracy is lower for larger $r$ values even with a significant amount of noise on responses. Finally, Figure \ref{fig:r10noise10} shows accuracy for varying numbers of triplets at different noise levels of $5\%$ and $10\%$.

\begin{figure}[H]
    \centering
    \includegraphics[width=0.5\linewidth]{figures/r2to10noise5_10k.png} 
    \caption{Train and test accuracy for noisy setting with 50 repetitions for each run. We fix number of triplets to 10000 and the ratio of noisy responses is approximately $5\%$.}
    \label{fig:noisy_r2to10}
\end{figure}


\begin{figure}[h]
\centering
\begin{minipage}{.5\textwidth}
  \centering
  \includegraphics[width=1\linewidth]{figures/r10_5noise.png}
  %\caption{Train and test accuracy for noiseless setting with 50 repetitions and and $r=2$.}
  %\label{fig:r10noise5}
\end{minipage}%
\begin{minipage}{.5\textwidth}
  \centering
  \includegraphics[width=1\linewidth]{figures/r10_10noise.png}
  %\caption{Train and test accuracy for noiseless setting with 50 repetitions and and $r=10$.}
\end{minipage}
\caption{Train and test accuracy for noisy setting and $r=10$ with 20 repetitions varying number of triplets, where the ratio of noisy responses is approximately $5\%$ (left) and $10\%$ (right).}
  \label{fig:r10noise10}
\end{figure}

\subsection{Empirical Evaluation: Food-100 Dataset}
We provide a brief description for the Food-100 dataset (More details can be found in the work of \cite{wilber2014cost}). The Food-100 dataset consists of carefully selected 100 food items, where each image has only one food. Answers to   190,376 triplets are collected from Amazon Mechanical Turk workers. Let $\mathcal{T}$ be the set of all triplets.  

For each iteration, we randomly select 20 items and call them $\mathcal{X}_\text{unseen}$. Then, we define a triplet set $\mathcal{T}_\text{unseen}$ from $\mathcal{X}_\text{unseen}$ as follows:
\begin{equation*}
    \mathcal{T}_\text{unseen} := \{ \{x_h, x_i, x_j\} : x_h \in \mathcal{X}_\text{unseen} \text{ or } x_i \in \mathcal{X}_\text{unseen} \text{ or } x_j \in \mathcal{X}_\text{unseen}\}.
\end{equation*}
Next, we uniformly sample triplets for the training set $\mathcal{T}_\text{train}$ from the set $\mathcal{T}\setminus \mathcal{T}_\text{unseen}$ to guarantee that there exist unseen items in $\mathcal{T}_\text{train}$. Finally, we uniformly sample triplets for the test set $\mathcal{T}_\text{test}$ from the set of all triplets $\mathcal{T}$. We apply the same splitting strategy on the $\mathcal{T}_\text{train}$ set to further split it to different training and validation part 20 times. We report the mean and standard deviation of the validation accuracies on these 20 validation parts.

\textbf{Choice of Parameters for Kernel Function: }
We conducted a parameter search on the validation set in the following range:
\begin{itemize}
    \item $\sigma: 0.01, 0.1, 1, 10$
    \item $\alpha: 0.01, 0.1, 1$
    \item $p: 2, 5, 7, 10$
\end{itemize}
 Our results show the best test accuracy values based on this search. 
