\section{Triplet outcome prediction experiment details}

Below we present details on the experiments from Section~\ref{sec:eval_datasets}.

The hyperparameters for each model were optimized and the best performing configurations for each $d$ are reported. We used the following grid of hyperparameters: $lr \in [1e-2,1e-3,1e-4,1e-5]$, $batch size \in [128, 256, 512, |\mathcal{T}|]$, $L_2$ regularizer $\lambda \in [0, 0.4, 1]$, $\sigma_\varepsilon \in [0.4,0.3,0.2,0.1,0.01]$, $\gamma \in [2,3,5,10,15,20,25]$. No regularization was used for CKL and $\gamma$-CKL because these models are scale-free.
    
    The best hyperparemeter configuration for each dataset are given below:
    \begin{itemize}
        \item \textbf{Musical Artists}
            \begin{itemize}
                \item t-STE (accuracy 86\%, nll 0.333): $D=50$, $lr=1e-4$, $\lambda=0$, $batch size = |\mathcal{T}|$
                \item CKL (accuracy 83.9\%, nll 0.395): $D=80$, $lr=1e-5$, $batch size = 256$
                \item Probit (accuracy 85.6\%, nll 0.352): $D=90$, $lr=1e-2$, $\lambda=0.4$, $\sigma_\varepsilon = 0.1$, $batch size = 512$
                \item $\gamma$-CKL (accuracy 86.5\%, nll 0.329): $D=80$, $lr=1e-3$, $\gamma = 5$, $batch size = |\mathcal{T}|$
            \end{itemize}
        \item \textbf{Food}
            \begin{itemize}
                \item t-STE (accuracy 84.9\%, nll 0.327): $D=80$, $lr=1e-2$, $\lambda=0$, $batch size = |\mathcal{T}|$
                \item CKL (accuracy 82.8\%, nll 0.389): $D=80$, $lr=1e-3$, $batch size = |\mathcal{T}|$
                \item Probit (accuracy 85\%, nll 0.327): $D=90$, $lr=1e-2$, $\lambda=1.0$, $\sigma_\varepsilon = 0.01$, $batch size = |\mathcal{T}|$
                \item $\gamma$-CKL (accuracy 85.\%, nll 0.327): $D=90$, $lr=1e-4$, $\gamma = 25$, $batch size = |\mathcal{T}|$
            \end{itemize}
        \item \textbf{Movie Actors}
            \begin{itemize}
                \item t-STE (accuracy 84.6\%, nll 0.4): $D=50$, $lr=1e-2$, $\lambda=0$, $batch size = 512$
                \item CKL (accuracy 79.1\%, nll 0.452): $D=15$, $lr=1e-2$, $batch size = 512$
                \item Probit (accuracy 85\%, nll 0.327): $D=90$, $lr=1e-4$, $\lambda=0,$, $\sigma_\varepsilon = 0.2$, $batch size = |\mathcal{T}|$
                \item $\gamma$-CKL (accuracy 86.3\%, nll 0.314): $D=90$, $lr=1e-3$, $\gamma = 20$, $batch size = |\mathcal{T}|$
            \end{itemize}
    \end{itemize}

\section{$\gamma \textsc{-CKLSearch}$ for moderate $n$} \label{sec:study_appendix}

\begin{algorithm}[ht]
    \caption{\textsc{$\gamma$-CKLSearch}}\label{alg:1std}
    \begin{algorithmic}[1]
    \STATE $m \gets 0$
    \STATE $\mathcal{U} \gets \emptyset$
    \STATE Initialize the prior $\mathcal{P}_0$ with $p^0_k \gets \frac{1}{n}, ~\forall k=1,2,\dots,n$
    \REPEAT
        \STATE Compute the sample mean $\bar{\vmu}_m$ and the sample covariance $\bar{\mSigma}_m$ from the current belief $\mathcal{P}_m$
        \STATE Find the largest eigenvalue of $\bar{\mSigma}_m$ and its eigenvector, $\lambda_\text{max}$ and $\vv_\text{max}$ respectively
        \STATE $\tilde{\vz}_1 \gets \bar{\vmu}_m + r \cdot \sqrt{\lambda_\text{max}} \vv_\text{max}$
        \STATE $\tilde{\vz}_2 \gets \bar{\vmu}_m - r \cdot \sqrt{\lambda_\text{max}} \vv_\text{max}$
        \STATE Find two objects $i \ne j$, s.t.
        \begin{align*}
            i &= \argmin_{i \in [n], i \not\in \mathcal{U}} p^m_i ||\vx_i - \tilde{\vz}_1||_2,\\
            j &= \argmin_{j \in [n], j \not\in \mathcal{U}} p^m_j ||\vx_j - \tilde{\vz}_2||_2
        \end{align*}
        \STATE $\mathcal{U} \gets \mathcal{U} \cup \{i,j\}$
        \STATE Obtain the response $\hat{y}$ from the user
        \STATE Update belief $\mathcal{P}_{m+1} \gets \textsc{Update}(\mathcal{P}_{m}, \hat{y})$ using Bayes rule %
        \STATE $m \gets m+1$
    \UNTIL{$t \in \{ i,j\} $}
    \end{algorithmic}

\end{algorithm}
    
In the case when there is only a finite number $n$ of points, we can keep the full posterior distribution $\mathcal{P} = [p_1, p_2, \dots, p_n]$ over all $n$ objects and propose a more efficient algorithm that the ones introduced in the previous subsection for continuous $\Omega$. Since $\vx_t$ is not known by the system during the search, we take a Bayesian approach to model the probability of the objects in $[n] = \{1,2,\dots,n\}$ to be the target, and at each step $m$ of the search maintain a full belief $\mathcal{P}^m = [p^m_1, p^m_2, \dots, p^m_n]$ over all $n$ objects. We start with a uniform prior $\mathcal{P}_0 = [\frac{1}{n}, \frac{1}{n}, \dots, \frac{1}{n} ]$.



\textbf{Choosing the next query to ask the user.} Similarly to $\textsc{GaussSearch}$, at each step we would like to ask a query $(i, j)$ that would maximize the \emph{expected information gain} given the current posterior belief $\mathcal{P}_m$ at step $m$ of the search:
\begin{align}
  (i, j) := \max_{i \ne j} \left( H(\mathcal{P}_m) - \E_{Y|\vx_i,\vx_j} [H(\mathcal{P}_m \mid Y)] \right), \label{eq:EIG}
\end{align}
where $Y \sim P(Y|\vx_i,\vx_j)$ is the marginalized belief over the answers to the query $(i,j)$, i.e. 
\begin{align*}
    P(Y=i \mid \vx_i, \vx_j) = \sum_{k = 1}^n p_{\vx_i, \vx_j, \vx_k} ~p^m_k.
\end{align*}
Performing an exhaustive search over all $O(n^2)$ possible pairs $(i,j)$ in order to find the optimal query in terms of (\ref{eq:EIG}) would be prohibitively slow, so we propose an alternative heuristic that has good performance in practice. 



We first detect the direction along which the variance of the belief is maximized, for that a sample mean and a covariance matrix $(\bar{\vmu}_m, \bar{\mSigma}_m)$ are computed from the current belief $\mathcal{P}_m$. Next we build a proto-query as a pair of two points $(\tilde{\vz}_1, \tilde{\vz}_2)$ in $\R^d$ that lie in the direction of the maximum variance of $\bar{\mSigma}_m$ on opposite sides of the sample mean $\bar{\vmu}_m$. In order to have a desired explore-exploit trade-off of a query, we control the distance from $\tilde{\vz}_j$ to $\bar{\vmu}_m$ by a multiplication parameter $r \in \R_+$. Finally, we find two distinct objects $(i,j)$ from $[n]$ which have the closest representations to $(\tilde{\vz}_1, \tilde{\vz}_2)$ in a $\mathcal{P}_m$-weighted Euclidean distance, which favors the near and more probable points. This pair $(i,j)$ becomes the next query to the oracle.



\textbf{Posterior \textsc{Update}.} After we obtain the response from the user, $\hat{y} \in \{ i,j\}$, the posterior probabilities are updated using Bayes rule $p^{m+1}_k = p^m_k ~P(Y = \hat{y} \mid \vx_i, \vx_j) / C, ~~k=1,2,\dots,n$,
where $C = \sum_{k=1}^n p^m_k ~P(Y = \hat{y} \mid \vx_i, \vx_j, \vx_k)$ is the normalizing costant.



The search finishes when the user indicates one of the query objects as his target, otherwise both query objects are considered to be non-target and further do not appear in the search. We keep track of the objects that we have displayed to the user already using the set of "used" objects $\mathcal{U}$. The complete search algorithm is outlined in Algorithm~\ref{alg:1std}.

The complexity of each step of the Algorithm~\ref{alg:1std} is $O(nd + d^2)$, since computing the sample covariance is $O(nd)$ and finding the principle eigenvector can be approximated with the power method in $O(d^2)$. Since in practice the number of features $d$ remains constant, the complexity is linear in the number of objects $n$.

\textbf{Additional comments on the face search experiment.} In total, we recruited 24 human participants. We presented 10 different target actors to each participant and asked to search for them. We performed an A/B testing by privily using \textsc{$\gamma$-CKLSearch} in the backend of the search interface for one half of the searches and \textsc{GaussSearch} for the other half. The target actors were chosen uniformly at random from a filtered set of 387 actors that had at least 100 associated triplets in $\mathcal{T}$. The A/B testing assignments were designed such that almost all of the chosen targets were paired exactly once with \textsc{$\gamma$-CKLSearch} and exactly once with \textsc{GaussSearch}. Overall the participants did 207 searches with 129 unique targets, 104 searches using GaussSearch and 103 searches using \textsc{$\gamma$-CKLSearch}. 19 participants completed all 10 searches, 1 participant completed 7 seaches, 1 participant completed 5 searches, 1 participant completed 3 searches, and 2 participants completed only 1 search. Based on the initial trial runs we ended up with the following choice of hyperparameters: $D = 5$, $\gamma=3$, $r=2$ and $\sigma_\varepsilon = 0.1$.

To ensure fair payment, we estimated the duration of our study in trial runs. Participants were paid the equivalent of 20 USD per hour. We do not collect any sensitive data, in particular we do not collect any data that makes a participant personally identifiable. The study design has been reviewed and approved by our IRB. 

\textbf{Instructions given to participants} The text below is a copy of the instructions given to our participants.

"Want to do a paid search? Here is the way!"

With <name withheld for double-blind review>, you can find that actor or actress interactively! We will show you four faces, and all you have to do is to click on the one who looks most like the person you have in mind. Just repeat this process a few times, until your target appears among the four faces. Click Found to take you to their details.
\begin{itemize}
    \item Create an account
    \item Come back here
    \item Start to make searches!

\end{itemize}

Are you registered? If yes, start to make searches!
Confidentiality:

In accordance with GDPR and European laws on privacy, our website uses cookies. However, only necessary cookies are used (to identify you and let you perform your search). If you chose to refuse the use of cookies, you won't be able to use our website. We will not share personally identifiable information with anyone. However, we may use anonymized and aggregated information collected from this experiment for research purposes, and potentially release such information publicly in the spirit of Open Science and reproducibility.


\section{Theorem~\ref{gamma-d}}\label{appendix:th21}
\begin{figure}[H]
\centering
\subfloat[$\hat{d}=10$ $\hat{\gamma} = 5$]{\includegraphics[width=5cm, height=5cm]{plots/dgamma_plt2.png}}
\qquad
\subfloat[$\hat{d}=20$ $\hat{\gamma} = 4$]{\includegraphics[width=5cm, height=5cm]{plots/dgamma_plt1.png}}
\caption{Linear relationship between $\gamma$ and $d$ for finite values of $d$.}
\label{fig:thm-plot}
\end{figure}

\textbf{Experiments on the relationship between $\gamma$ and $d$}. In our experiment first we fix the reference values of $\hat{d}$ and $\hat{\gamma}$ for which compute the average probability of the correct answer $p_Q(\hat{\gamma},\hat{d})$. Then we iterate over the values of $d > \hat{d}$ and for each we find the corresponding $\gamma$ that minimizes $|p_Q(\hat{\gamma},\hat{d}) - p_Q(\gamma,d)|$ via a gridsearch. The best values of $\gamma$ are reported in Fig.~\ref{fig:thm-plot}. In all trials we kept $N=1000$ and $|\mathcal{T}| = 10'000$. We observe a linear relationship between $\gamma$ and $d$ even for finite values of $d$, which matches the limit result of Theorem~\ref{gamma-d}.

\begin{proof}[Proof of Theorem~\ref{gamma-d}]
Consider two points $\vx_a,\vx_b \in \R^d$ sampled uniformly from a unit ball $\mathcal{B}$ that form a query to the oracle $Q =(\vx_a,\vx_b)$. After asking $Q$ we observe the answer $Y \in \{\vx_a,\vx_b\}$ under the $\gamma$-CKL model for some fixed $\gamma \geq 2$. Then the probability that the answer $Y$ is correct, $p_Q$, is
\begin{align}
    p_Q =& \int_{r_1=0}^1 \int_{r_2=r1}^1 \frac{ r_2^\gamma}{r_1^\gamma + r_2^\gamma} S_d(r_1) S_d(r_2) \frac{1}{V_d} \frac{1}{V_d} dr_1 dr_2 \nonumber \\
    &+ \int_{r_1=0}^1 \int_{r_2=0}^{r_1} \frac{ r_1^\gamma}{r_1^\gamma + r_2^\gamma} S_d(r_1) S_d(r_2) \frac{1}{V_d} \frac{1}{V_d} dr_1 dr_2 \nonumber \\
    =& \int_{r_1=0}^1 \int_{r_2=r1}^1 \frac{ r_2^\gamma}{r_1^\gamma + r_2^\gamma} r_1^{d-1} r_2^{d-1} d^2 dr_1 dr_2 \label{int1} \\
    &+ \int_{r_1=0}^1 \int_{r_2=0}^{r_1} \frac{ r_1^\gamma}{r_1^\gamma + r_2^\gamma} r_1^{d-1} r_2^{d-1} d^2 dr_1 dr_2 \label{int2},
\end{align}

where 
\begin{align*}
    S_d(r) = \frac{2\pi^{\frac{d}{2}}}{\Gamma(\frac{d}{2})}r^{d-1}, ~V_d = \frac{\pi^{\frac{d}{2}}}{\Gamma(\frac{d}{2}+1)}
\end{align*}
are the respective surface and volume of the unit ball $\mathcal{B}$.



Consider (\ref{int1}), 
$$
    \int_{r_1=0}^1 \int_{r_2=r1}^1 \frac{ r_2^\gamma}{r_1^\gamma + r_2^\gamma} r_1^{d-1} r_2^{d-1} d^2 dr_1 dr_2.
$$

If we increase $d$, the distance from the center of the ball to a random inside point will be close to 1. We use Taylor approximation of the probability model at $(1,1) \in \R^2$:
\begin{align*}
    \frac{r_2^\gamma}{r_1^\gamma + r_2^\gamma} &= \frac{1}{2} - (r_1 - 1)\frac{\gamma}{4} + (r_2 - 1)\frac{\gamma}{4} + R(r_1, r_2) \\
    &= P(r_1, r_2) + R(r_1, r_2).
\end{align*}

Let's fix some $0 < \varepsilon < 1$. Then
\begin{align*}
    (\ref{int1}) =& \int_{r_1=0}^1 \int_{r_2=r1}^1 \frac{ r_2^\gamma}{r_1^\gamma + r_2^\gamma} r_1^{d-1} r_2^{d-1} d^2 dr_1 dr_2  \\
    =& \int_{r_1=\varepsilon}^1 \int_{r_2=r1}^1 \frac{ r_2^\gamma}{r_1^\gamma + r_2^\gamma} r_1^{d-1} r_2^{d-1} d^2 dr_1 dr_2 \\
    &+ \int_{r_1=0}^\varepsilon \int_{r_2=r1}^1 \frac{ r_2^\gamma}{r_1^\gamma + r_2^\gamma} r_1^{d-1} r_2^{d-1} d^2 dr_1 dr_2 \\
    =& \int_{r_1=\varepsilon}^1 \int_{r_2=r1}^1 P(r_1, r_2) r_1^{d-1} r_2^{d-1} d^2 dr_1 dr_2 \\
    &+ \int_{r_1=\varepsilon}^1 \int_{r_2=r1}^1 R(r_1, r_2) r_1^{d-1} r_2^{d-1} d^2 dr_1 dr_2 \\
    &+ \int_{r_1=0}^\varepsilon \int_{r_2=r1}^1 \frac{ r_2^\gamma}{r_1^\gamma + r_2^\gamma} r_1^{d-1} r_2^{d-1} d^2 dr_1 dr_2.
\end{align*}

First note that the last summand is $o(1)$ when $d \rightarrow \infty$:
\begin{align*}
    &\int_{r_1=0}^\varepsilon \int_{r_2=r1}^1 \frac{ r_2^\gamma}{r_1^\gamma + r_2^\gamma} r_1^{d-1} r_2^{d-1} d^2 dr_1 dr_2 \\
    &\leq \int_{r_1=0}^\varepsilon \int_{r_2=r1}^1 r_1^{d-1} r_2^{d-1} d^2 dr_1 dr_2 \\
    &\leq \frac{1}{2} \varepsilon^d(2-\varepsilon^d) = o(1).
\end{align*}

Now the integral with the $P(r_1, r_2)$ term can be computed as follows:

\begin{align*}
    &\int_{r_1=\varepsilon}^1 \int_{r_2=r1}^1 P(r_1, r_2) r_1^{d-1} r_2^{d-1} d^2 dr_1 dr_2 = \\
    =& \int_{r_1=0}^1 \int_{r_2=r1}^1 \frac{1}{2} r_1^{d-1} r_2^{d-1} d^2 dr_1 dr_2 + \frac{1}{4} \varepsilon^d - \frac{1}{2} \varepsilon^{2d} \\
    &+ \int_{r_1=0}^1 \int_{r_2=r1}^1 (r_1 - 1)\frac{\gamma}{4} r_1^{d-1} r_2^{d-1} d^2 dr_1 dr_2\\
    &+ \frac{\gamma d \varepsilon^d}{4} \left( \frac{ \varepsilon^{2} - 2 }{2d}  + \varepsilon\left( \frac{1}{d+1} - \frac{\varepsilon^d}{2d+1} \right) \right) \\
    &+ \int_{r_1=0}^1 \int_{r_2=r1}^1 (r_2 - 1)\frac{\gamma}{4} r_1^{d-1} r_2^{d-1} d^2 dr_1 dr_2 \\
    &+ \frac{\gamma \varepsilon^d (d(2d(\varepsilon-1)-3) - 1) \varepsilon^d + 2(2d+1)}{8 (d+1)(2d+1)} \\
    =& \frac{1}{4} + \frac{\gamma}{4} \frac{d^2(3d+1)}{2d^2(d+1)(2d+1)} -\frac{\gamma}{4} \frac{d^2}{4d^3 + 2d^2} + o(1) \\
    =& \frac{1}{4} + \frac{\gamma}{4} \frac{d}{(d+1)(2d+1)} + o(1).
\end{align*}

Finally, consider the remaining integral, 
$$
    \int_{r_1=\varepsilon}^1 \int_{r_2=r1}^1 R(r_1, r_2) r_1^{d-1} r_2^{d-1} d^2 dr_1 dr_2.
$$

Using Taylor's theorem for multivariate functions, we can get an upper bound for its absolute value:
\begin{align*}
    &\left| \int_{r_1=\varepsilon}^1 \int_{r_2=r1}^1 R(r_1, r_2) r_1^{d-1} r_2^{d-1} d^2 dr_1 dr_2 \right| \\
    &\leq \frac{M(\gamma)}{2} \int_{\mathcal{X}} \left( (r_1-1)^2 + (r2-1)^2 \right) r_1^{d-1} r_2^{d-1} d^2 dr_1 dr_2 \\
    &+ \frac{M(\gamma)}{2} \int_{\mathcal{X}} 2(r_1-1)(r_2-1) r_1^{d-1} r_2^{d-1} d^2 dr_1 dr_2
\end{align*}
where
\begin{align*}
    M(\gamma) &= \max_{\alpha = |2|, (r_1,r_2) \in \mathcal{X}} \left| D^{\alpha}\left[ \frac{r_2^\gamma}{r_1^\gamma + r_2^\gamma} \right] \right|, \\
    \mathcal{X} &= \{(r_1,r_2) ~|~ r_1 \in [\varepsilon, 1],~ r_2 \in [r_1, 1]\},
\end{align*}
and
\begin{align*}
    \left| D^{(2,0)}\left[ \frac{r_2^\gamma}{r_1^\gamma + r_2^\gamma} \right] \right| &= \left| \frac{\gamma r_2^{\gamma-2} r_1^\gamma ( (\gamma-1) r_1^\gamma - (\gamma+1)r_2^\gamma ) }{(r_1^\gamma + r_2^\gamma)^3} \right|, \\ \\
    \left| D^{(1,1)}\left[ \frac{r_2^\gamma}{r_1^\gamma + r_2^\gamma} \right] \right| &= \frac{\gamma^2 r_2^{\gamma-1} r_1^{\gamma-1} ( r_2^\gamma - r_1^\gamma ) }{(r_1^\gamma + r_2^\gamma)^3}, \\ \\
    \left| D^{(0,2)}\left[ \frac{r_2^\gamma}{r_1^\gamma + r_2^\gamma} \right] \right| &= \left| \frac{\gamma r_2^{\gamma} r_1^{\gamma-2} ( (\gamma+1) r_1^\gamma ) - (\gamma-1)r_2^\gamma }{(r_1^\gamma + r_2^\gamma)^3} \right|.
\end{align*}
For a big enough $d$, if $\gamma$ grows with $d$, the maximum of $M(\gamma)$ is achieved when $r_1 = r_2$ with $M(\gamma) \leq \frac{\gamma}{4} \varepsilon^{-2}$. We will show this for $\left| D^{(2,0)}\left[ \frac{r_2^\gamma}{r_1^\gamma + r_2^\gamma} \right] \right|$, the other two cases can be proved similarly. Indeed,
\begin{align*}
    \left| D^{(2,0)}\left[ \frac{r_2^\gamma}{r_1^\gamma + r_2^\gamma} \right] \right| &= \left| \frac{\gamma r_2^{\gamma-2} r_1^\gamma ( (\gamma-1) r_1^\gamma - (\gamma+1)r_2^\gamma ) }{(r_1^\gamma + r_2^\gamma)^3} \right| \\
    &= \gamma \frac{ r_2^{\gamma-2} r_1^\gamma ( (\gamma+1)r_2^\gamma - (\gamma-1) r_1^\gamma ) }{(r_1^\gamma + r_2^\gamma)^3} \\
    &= \gamma \frac{ \left(\frac{r_2}{r_1}\right)^\gamma ( (\gamma+1)\left(\frac{r_2}{r_1}\right)^\gamma - (\gamma-1)) }{ r_2^2 (1+\left(\frac{r_2}{r_1}\right)^\gamma)^3 },
\end{align*}
which is equal to $\frac{\gamma}{4} \varepsilon^{-2}$ when $r_1 = r_2$ and goes to 0 with $d \rightarrow \infty$ when $r_1 < r_2$.



Finally
\begin{align*}
    &\int_{\mathcal{X}} \left( (r_1-1)^2 + 2(r_1-1)(r_2-1) + (r2-1)^2 \right) r_1^{d-1} r_2^{d-1} d^2 dr_1 dr_2 \\
    &= \varepsilon^{2d} P_1 + \varepsilon^{d} P_2 + \frac{3d+4}{(d+1)^2(d+2)},
\end{align*}
where
$$
    P_1 = -\frac{d \left(d^2+(d+1)^2 \varepsilon ^2-2 (d+2)^2 \varepsilon +6 d+13\right)+8}{(d+1)^2 (d+2)}
$$
and
$$
    P_2 = \frac{(2 (d+2) d+1) d \varepsilon ^2-4 d (d+2) (d+1) \varepsilon +2 (d+2) (d+1)^2}{(d+1)^2 (d+2)} 
$$
are two polynomial fractions.



Putting everything together we can upper bound the remainder by
\begin{align*}
    &\left| \int_{r_1=\varepsilon}^1 \int_{r_2=r1}^1 R(r_1, r_2) r_1^{d-1} r_2^{d-1} d^2 dr_1 dr_2 \right| \\
    &\leq \frac{\gamma \varepsilon^{-2}}{8} \left( \varepsilon^{2d} P_1 + \varepsilon^{d} P_2 + \frac{3d+4}{(d+1)^2(d+2)} \right).
\end{align*}

Also, due to symmetry, $(\ref{int1}) = (\ref{int2})$, and thus

\begin{align*}
    p_Q &= \frac{1}{2} + \frac{\gamma}{2} \frac{d}{(d+1)(2d+1)} + \hat{R} + o(1)
\end{align*}
where
\begin{align*}
    &|\hat{R}| \leq \frac{\gamma \varepsilon^{-2}}{4} \left( \varepsilon^{2d} P_1 + \varepsilon^{d} P_2 + \frac{3d+4}{(d+1)^2(d+2)} \right).
\end{align*}


We see that if
$$
    \frac{\gamma}{d} = c_1 + o(1)
$$ 
and $d \rightarrow \infty$, then
$$
p_Q = c_2 + o(1),
$$
where $c_1 > 0$, $c_2>0$ are constants.
\end{proof}

\section{$\gamma$-$d$ Relation in the Embedding Experiments}

For $\gamma$-CKL as $d$ increases, the best performing
values of $\gamma$ tend to also increase, which is aligned with the findings of Theorem~\ref{gamma-d}, see Fig.~\ref{fig:gamma_d_rel}.
For each dataset and each value of $d$ we report the running average of the mean of the top 10
best performing values of $\gamma$ for that $d$. We see that the running average value of $\gamma$ is uniformly
lower for for the Musical Artists dataset than for the other two datasets. We suspect this is
because the dataset itself contains relatively small average number of triplets per object. That
is why the $\gamma$-CKL embedding does not profit from increasing the values of $\gamma$, which could lead
the model to be more confident when predicting outcome probabilities.


\begin{figure}[t]
\centerline{\includegraphics[scale=0.4]{plots/gamma_d.png}}
\caption{Running average of the best performing values of $\gamma$ in $\gamma$-CKL embedding as we
increase the embedding dimensionality $d$.}
\label{fig:gamma_d_rel}
\end{figure}

\section{Proof of Proposition~\ref{spheres}}
\begin{proof}
    First we will show that for a query $Q_i = (\vx^a_i,\vx^b_i)$ the set of points $\mathcal{S}_i \subset \Omega$ for which the expected log-likelihood of the answer $Y$ is maximized forms a $d$-dimensional sphere.  For ease of reading, we drop the index $i$ and simply write $Q = (\vx_a,\vx_b)$.

    
    
    
    The oracle will answer $Y = \vx_a$ with probability $p(\vx_a,\vx_b;\vx_t) = \frac{\lVert \vx_b - \vx_t \rVert^\gamma}{\lVert \vx_a - \vx_t \rVert^\gamma + \lVert \vx_b - \vx_t \rVert^\gamma}$. Then all points $\vx \in \Omega$ s.t. $p_{\vx_a,\vx_b,\vx} = p_{\vx_a,\vx_b,\vx_t}$ will have the largest expected log-likelihood values. Now denoting
    $$
        c := \frac{\lVert \vx_a - \vx_t \rVert^2}{\lVert \vx_b - \vx_t \rVert^2} = {\left( \frac{1}{p} - 1 \right)}^\frac{2}{\gamma},
    $$
    and observing
    \begin{align*}
        p = \frac{\lVert \vx_b - \vx_t \rVert^\gamma}{\lVert \vx_a - \vx_t \rVert^\gamma + \lVert \vx_b - vx_t \rVert^\gamma} &= \frac{1}{\frac{\lVert \vx_a - \vx_t \rVert^\gamma}{\lVert \vx_b - \vx_t \rVert^\gamma}+ 1},
    \end{align*}
    we can define the set $\mathcal{S}_Q$ by
    $$
        \sum_{j=1}^d (\vx_j - (\vx_a)_j)^2 - c \sum_{i=j}^D (\vx_j - (\vx_b)_j)^2 = 0,
    $$
    which is equivalent to
    $$
        \sum_{i=j}^d (\vx_j-\vz_j)^2 = r.
    $$
    for
    \begin{align*}
        z_j &= \frac{c (\vx_b)_j-(\vx_a)_j}{1-c},\\
        r &= \sum_{j=1}^d \frac{(c (\vx_b)_j - (\vx_a)_j)^2 }{(1-c)^2} - \frac{(\vx_a)_j^2 - c(\vx_b)_j^2}{(1-c)}.
    \end{align*}
    
    
    
    
    Hence, for a fixed query, the points that have the same likelihood as $\vx_t$ (and which will have the maximal expected log-likelihood) form a sphere in $\mathbb{R}^d$. 
    For two distinct query points $Q_1$ and $Q_2$, the set of points with maximal expected log-likelihood for $Q_1$ and $Q_2$ will lie in the intersection of $\mathcal{S}_1$ and $\mathcal{S}_2$, i.e., at least in a $(d-1)$-dimensional sphere $\mathcal{S}_1 \cap \mathcal{S}_2$. Consider the third query point $Q_3$. The intersection $\mathcal{S}_1 \cap \mathcal{S}_2 \cap \mathcal{S}_3$ is at least a $(d-2)$-dimensional sphere if $\vz_3$ does not lie on the line intersecting $\vz_1$ and $\vz_2$, otherwise the points in $\mathcal{S}_1 \cap \mathcal{S}_2$ are equidistant from $\vz_3$, and since $\vx_t \in \mathcal{S}_1 \cap \mathcal{S}_2$, $\mathcal{S}_1 \cap \mathcal{S}_2 = \mathcal{S}_1 \cap \mathcal{S}_2 \cap \mathcal{S}_3$, and no additional dimensionality reduction of the spheres intersection is achieved (see Fig~\ref{fig:spheres-plot} for illustration). Similarly, for $d+1$ queries $Q_1$,$Q_2$,\dots,$Q_{d+1}$, the sufficient condition for
    $$
        \mathcal{S}_1 \cap \mathcal{S}_2 \cap \dots \cap \mathcal{S}_{d+1} = \vx_t
    $$
    is
    \begin{align}
        \text{rank}(\tilde{\vz}_{1}-\tilde{\vz}_{d+1}, \tilde{\vz}_{2}-\tilde{\vz}_{d+1}, \dots, \tilde{\vz}_{d}-\tilde{\vz}_{d+1}) = d. \label{spheres-rank}
    \end{align}
    The intersection of the $d+1$ corresponding spheres will result in exactly one point, $\vx_t$. Thus the expected log-likelihood after $d+1$ such queries will be maximized only at $\vx_t$. Now by chosing a uniform prior over $\Omega$, in expectation over the outcomes of any set of queries $\tilde{\mathcal{Q}}$ that satisfies (\ref{spheres-rank}) the posterior will be maximized only at $\vx_t$ and then the claim follows immediately.
\end{proof}
