\section{PROBLEM SETTING}
Let objects be represented by the points $\bm{x}_1, \bm{x}_2, \ldots$, where each $\bm{x}_i$ is drawn from the distribution $\mathcal{D}'$. In the noiseless setting, we are given a set of triplet comparisons in the form of
\begin{eqnarray*}
    \text{sign}(\text{dist}^2(\bm{x}_h,\bm{x}_i)-\text{dist}^2(\bm{x}_h,\bm{x}_j)).
\end{eqnarray*}
We are interested in providing a theoretical understanding on the problem of learning kernelized Mahalanobis metric from triplet comparison queries. Our work extends the learning theoretic results of \cite{mason2017learning} for linear metric learning to more general nonlinear metrics.

Let $\mathcal{S}$ denote the set of triplets generated from random triples $t=\{\bx_{h}, \bx_{i}, \bx_{j}\}$, where each triple is independent and randomly chosen from the distribution $\mathcal{D}$, i.e., given that $\bx_i\sim \mathcal{D}'$, each triple $t_{\{h,i,j\}}\in \mathcal{S}$ is randomly sampled from the stacked distribution $\mathcal{D}$. Therefore, the total number of objects is $3|\mathcal{S}|$ for $|\mathcal{S}|$ triplets in the general case. For each random triplet $t_{\{h,i,j\}}$, we observe a possibly noisy answer $y_t\in\{\pm 1\}$, which is an indication of $\text{sign}\left(\|L\phi_h-L\phi_i\|_\mathcal{H}^2-\|L\phi_h-L\phi_j\|_\mathcal{H}^2\right)$. Specifically, we assume that there exists an unknown kernelized metric that is consistent with the data and classifies any triplet $t$ correctly with a probability greater than $1/2$ where this probability is taken with respect to any randomness in $y_t$ and may depend on the specific triplet $t$. This is a common practical assumption when working with human judgment that some queries are inherently more noisy than others \citep{coombs1964theory,rau2016model}. We further assume that the $y_t'$s are statistically independent. Our goal is to learn a metric parameterized by a linear map $L$ that predicts triplets well on average. Namely, we seek an $L$ that minimizes the misclassification probability:
\begin{eqnarray}
    \text{Pr}\left(y_t\neq \text{sign}\left(\|L\phi_h-L\phi_i\|_\mathcal{H}^2-\|L\phi_h-L\phi_j\|_\mathcal{H}^2\right)\right).\label{0-1 loss}
\end{eqnarray}
Note that (\ref{0-1 loss}) is equal to the expected $0/1$ loss. In practice, minimizing $0/1$-loss is intractable and the above objective is relaxed to minimizing the true risk, which is defined below:
  \begin{eqnarray}\label{true_risk}
    {R}(L):= 
    \\ 
&\hspace{-15mm}\mathbb{E}_{t\sim \mathcal{D}, y_t \in \{\pm 1\}}[l(y_t(\|L\phi_h-L\phi_i\|_\mathcal{H}^2-\|L\phi_h-L\phi_j\|_\mathcal{H}^2))], \nonumber
\end{eqnarray}
for an arbitrary convex and $\alpha$-Lipschitz loss $\ell : \mathbb{R}\rightarrow \mathbb{R}_{\geq 0}$,
where the expectation is over random triplet coming from a distribution $\mathcal{D}$ and binary random label $y_t$ conditioned on $t$, where $t=\{\bx_h, \bx_i, \bx_j\}$ and $\{\bx_h, \bx_i, \bx_j\}\sim \mathcal{D}$. If $\ell$ is chosen to upper bound the $0/1$-loss (e.g., the hinge loss $\ell(z) = \max(1-z, 0)$ or the logistic loss $\ell(z) = \log(1 + \exp^{-z}))$, then $R(L)$ upper bounds the misclassification probability.

Unfortunately, we cannot minimize $R(L)$ directly as the joint distribution of $(t, y_t)$ is unknown. Instead, given a set of triplets $\mathcal{S}$ and their labels $y_t$, we wish to learn a kernelized metric parameterized by a bounded linear map $L : \mathcal{H} \rightarrow \mathcal{H}$ that predicts triplets as well as possible on the observed data. 
\begin{eqnarray}
  &&\hspace{-11mm}\widehat{R}_\mathcal{S}(L):= \label{empirical risk first}
  \\ &&\hspace{-9mm} \frac{1}{|\mathcal{S}|}\sum_{(t,y_t)\in \mathcal{S}}l(y_t(\|L\phi_h-L\phi_i\|_\mathcal{H}^2-\|L\phi_h-L\phi_j\|_\mathcal{H}^2)). \nonumber 
\end{eqnarray}
  We refer to $\widehat{R}_\mathcal{S}(L)$ as the empirical risk as it is an unbiased estimator of the true risk $R(L)$. For any given $\ell$, we wish to answer three questions:
  \begin{enumerate}    
      \item Regularizing a norm on $L$ controls the flexibility of the metric and hence the model’s predictions. What is the appropriate way to regularize to balance the bias-variance tradeoff of metric learning?
      \item What can we guarantee about the generalization performance of the solution to (\ref{empirical risk first}) and how does this depend on the norm we choose to regularize on $L$?
        \item As written, (\ref{empirical risk first}) is a potentially infinite dimensional, nonconvex optimization problem. How can it be made computationally tractable?
  \end{enumerate}
We refer to Section \ref{sec:Theoretical Guarantees} for the first and second questions, and Section \ref{sec:practical} for the last question.       
  