% 1) what are similarity choice models?
% 2) where are they used?
% 3) how metric-based similarity choice models follow IIA
% 4) how people usually test for IIA and build models ard them, and why they do not apply here
% 5) what Tversky did in Features of Similarity 1977 paper
% 6) what we do: define two tests, one classical and one Bayesian
% 7) the two datasets we have curated; the result we get on these datasets
% 8) we go one step forward and eliminate pop homogeneity as a cause of IIA violations
% 9) conclude the intro, highlighting scope for richer choice models

Discrete choice models provide a probabilistic framework for reasoning about how humans make choices when presented with a set of alternatives \citep{train2009discrete}. They are widely used in many domains, such as transportation \citep{mcfadden1974measurement} and recommender systems \citep{rendle2009bpr}. In this paper, we focus on a specific class of discrete choices: similarity judgements. The simplest example of this class is the triplet comparison: ``with respect to an apple, what is more similar: pear or orange?'' More generally, a \textit{similarity choice question} asks a user to select from a \textit{choice-set} that the item that is most similar to a given \textit{target} item. Similarity choice data differs significantly from other choice data because of the dependency on the target. Indeed, in the above example, replacing the target apple by grapefruit might significantly change the choice distribution between pear and orange. 

A key application of similarity choice data is \textit{ordinal embedding}, where the goal is to learn or refine item embeddings from ordinal comparisons \citep{vankadara2023insights}. A good embedding reflects human similarity judgments through inter-point distances. Many embedding methods fit a similarity choice model to datasets such as \citet{wilber2014cost}. Ordinal embedding is particularly valuable when item metadata fails to capture user-perceived similarity. For instance, \citet{magnolfi2025triplet} show that such embeddings help predict consumer demand for breakfast cereals. A second use-case arises in interactive search, where a user provides a rough textual description of a latent target and is iteratively shown item sets to refine their preferences \citep{Biswas2019, chumbalov2020}. While the target is implicit (in the user's mind), each selection is still a similarity choice. In both settings, the effectiveness of algorithms rests on the ability of the underlying similarity choice model to faithfully capture human judgments.

Similarity choice models assign a probability distribution over items in a choice-set given a target. Two popular models, Crowd Kernel Learning (CKL) \citep{tamuz2011adaptively} and t-Stochastic Triplet Embedding (t-STE) \citep{maaten2012stochastic}, represent items as points in $\mathbb{R}^d$ and define similarity as a decreasing function of Euclidean distance. Given a choice-set $C$ and target $t$, the probability that item $i \in C$ is selected is proportional to its similarity to $t$. This simple structure makes these models easy to learn and interpret, leading to their popularity. Yet, it is this simple structure that leads to both models adopting the independence of irrelevant alternatives (IIA) property \citep{luce1959individual}. Informally, IIA asserts that the relative odds of choosing between any two items $i$ and $j$ remain unchanged regardless of the presence of other items in the choice-set. The IIA property is equivalent to assuming that choices are dictated purely by item-specific scores; in the case of similarity choice models, this score is a measure of the item-target similarity (see Section \ref{sec:models_methods} for more details).

In this work, we are motivated by the broad question of whether it is possible to design newer similarity choice models that are better than the current state-of-the-art models \citep{tamuz2011adaptively, maaten2012stochastic}. Such a model, while continuing to be easy to learn, should better reflect human judgements of similarity than current models. It should ultimately lead to better outcomes for tasks such as ordinal embedding and interactive search. Broadly, there are two main directions to generalize existing models. The first is to keep the property that choice probabilities are proportional to some similarity measure (and consequently IIA is obeyed), but work with a more flexible distance/similarity metrics than Euclidean spaces allow. The second is to consider models that include \textit{context effects}, where the choice set of items influences the perception of similarity; such a model would not obey IIA. An important step, therefore, is to test whether the IIA property indeed holds in real similarity choice data.

In the literature, testing for IIA is a well-studied topic \citep{Cheng2007, seshadri2019fundamental}. Nearly all such studies frame the problem as a hypothesis test with the null hypothesis being that the data satisfies IIA, \emph{i.e.}, it is plausibly generated from a model that satisfies IIA. This hypothesis is rejected only if there is sufficient evidence to the contrary. In addition to these tests, many choice models that violate IIA have been proposed, both in the psychology literature \citep{tversky1972elimination, tversky1993context} as well as the machine learning literature \citep{seshadri2019discovering, tomlinson2021learning}. A particularly popular model that violates IIA is the mixed MNL model \citep{train2009discrete}. 

Measuring IIA violations in similarity choice data poses some challenges that do not arise in the corresponding task with preference choice data. First, unlike preference choices, we do not (yet) have any candidate models that account for context effects. Thus, we cannot perform a likelihood ratio test of the form used in \cite{seshadri2019discovering}. Second, taking existing hypothesis testing methods off-the-shelf would require splitting the data into different buckets according to the targets and testing for IIA separately on each bucket. Not only would this yield a large number of test statistics, the statistical significance of the test would also be greatly diminished due to partitioning the dataset.

The only known work critiquing the IIA assumption in the context of similarity choice data is by \citet{tversky1977features}. In this seminal work, Tversky gathers responses to a survey of handcrafted similarity choice question pairs, where both questions in a pair differing only in one item in the choice set. \citet{tversky1977features} shows that the survey answers indicate statistical significant deviation from IIA. Moreover, these deviations can be explained in terms of `context effects', \emph{i.e.}, the changing influence of item features based on their prevalence in the context set. However, \citet{tversky1977features} does not propose a probabilistic similarity choice model, let alone a learnable one. Moreover, the experiments on handcrafted queries shed no light on the prevalence of context effects in questions composed of random items. Indeed, learning similarity choice models would typically take place through such random data \citep{wilber2014cost}. Finally, his tests are not suitable for measuring the prevalence of IIA on such a dataset. Our work aims to address these gaps in the literature. To this end, we make two significant contributions: a new method for testing for IIA, and a dataset suitable to apply such a test. 

Our proposed tests for IIA in similarity choice models can be viewed as a as  \textit{goodness of fit} tests~\citep{Lehmann2022}, where the null hypothesis is that the data obeys IIA. Within this framework, we first design a classical $\chi^2$ test, which is commonly used for categorical data. We then adapt this to a Bayesian setting, using the well-established Posterior Predictive Check (PPC) framework~\citep{gelman2013philosophy}. Both tests yield a single $p$-value which tell us the confidence with which we can reject the null hypothesis (that IIA holds) over any given dataset. We provide more details of these methods in Section ~\ref{sec:models_methods}. We test both methods on synthetic data in Section ~\ref{sec:synthetic}, where we find that both tests have similar power. The main advantage of the Bayesian setting is the added flexibility and interpretability it provides, which we highlight below. 

We apply these tests on two datasets, both collected through surveys designed by us on the \href{https://www.prolific.com}{\texttt{Prolific}} website. Both surveys work with a set of hundred food items chosen from the CROCUFID dataset \citep{CROCUFID}. The two surveys differ primarily in the manner in which the questions were crafted. While one dataset had questions formed by choosing targets and choice set items at random, the other was carefully crafted to highlight context effects, similar to \citet{tversky1977features}. Notably, both datasets have the same universe of items. Each survey question was answered by multiple participants, allowing us to calculate the statistics of each options' response. Applying both the aforementioned tests, we show that there is a strong evidence to suggest that \textit{IIA does not hold in these similarity choice datasets}. Similar experiments on synthetic data improve the interpretability of our results. See Section ~\ref{sec:experiments} for more details.

Beyond establishing that IIA is violated in similarity choice data, we extend our analysis in two directions, both of which rest on the Bayesian model we develop for the PPC test. First, we estimate a parameter that quantifies the extent to which a dataset deviates from IIA. We find that the strength of deviation in the random dataset is nearly as strong as in the handcrafted dataset. Second, we design a test to check whether the survey respondents we have in our dataset can be viewed as a single homogenous population. A mixture of populations, each satisfying IIA, can lead to data that does not obey IIA (see example in Appendix~\ref{app:heterogeneity}). By showing that our survey respondents are indeed homogenous, we eliminate a potential confounding factor for IIA violations. Put together, our results strongly suggest that a similarity choice model expressing context effects can outperform current baselines when trained on such data. This remains an important direction of future work. In this work, we show the flexibility of Bayesian models in the context of testing for IIA in similarity choice models. The code and data are hosted in GitHub\footnote{\url{https://github.com/correahs/similarity-uai-2025}}.

% Discrete choice models are probabilistic models of how humans make choices when presented with a set of alternatives \citep{train2009discrete}. These models are useful to make predictions in a variety of settings, ranging from transportation \citep{mcfadden1974measurement} to recommender systems \citep{rendle2009bpr}. The most commonly discrete choice model is the Bradley-Terry-Luce (BTL) choice model \citep{luce1959individual} which posits that each item $i$ in the universe of items has a latent scalar utility $u_i$. When presented with a choice-set of items $C$, item $i \in C$ is chosen with probability proportional to $\exp(u_i)$. 

% BTL has been widely adopted because its parameters are easy to interpret and to learn from data \citep{maystre2015fast}. This simplicity stems from the independence of irrelevant alternatives (IIA) property, which is baked into the model: for any two choice-sets, the odds between choosing two items $i$ and $j$ do not depend on the other (irrelevant) items in the choice-sets \citep{luce1959individual}. However, in several scenarios, it has been demonstrated that real data does not obey the IIA property \citep{tversky1972elimination, tversky1993context,Cheng2007}. In such scenarios, the BTL model yields poor predictions. These observations have prompted researchers to develop richer choice models that better explain real choice data~\citep{seshadri2019discovering}.

% In this paper, we are interested in the discrete choice setting pertaining to similarity judgements. The simplest example of a similarity choice question is the following triplet comparison: ``with respect to an apple, what is more similar: pear or orange?'' More generally, a \textit{similarity choice question} involves choosing an item from a \textit{choice-set} that is most similar to a certain \textit{target} item. The study of similarity choice data is fundamentally different from classic choice data because of the influence of the target. Indeed, in the above example, the odds between the items in the choice-set would change drastically if the target \textit{apple} is replaced by \textit{grapefruit}.  

% The prototypical use-case of similarity choice data is in the task of ordinal embedding, where the goal is to learn (or refine) embeddings of items from purely ordinal data \citep{vankadara2023insights}. A good embedding is one where the distance between points reflects human similarity judgments. Most ordinal embedding algorithms fit a similarity choice model to a corpus of similarity choice data, such as the dataset of \citet{wilber2014cost}. Ordinal embedding is a useful technique in settings where item metadata does not correspond very well with its user-perceived notion of similarity. For example, \citet{magnolfi2025triplet} present a study where embeddings learned via ordinal data was useful in predicting demand for breakfast cereals. A second use-case of similarity choice data arises in the context of recommender systems and information retrieval; specifically, in the context of interactive search \cite{Biswas2019, chumbalov2020}. Here, a user provides a rough textual description of its latent target and is shown a small set of items; selecting the item closest to their latent target refines the next choice set presented by the system. While the target is implicit, the user-system interaction is still a similarity choice task.

% A similarity choice model specifies a probability distribution for the choice for any target and choice-set combination. Two popular similarity choice models are crowd kernel learning (CKL) \citep{tamuz2011adaptively} and t-stochastic triplet embedding (t-STE) \citep{maaten2012stochastic}. They both represent items as points in $\mathbb{R}^d$ and define a similarity metric between two items as a decreasing function of their Euclidean distance. Given a choice-set $C$ and a target $t$, the probability that an item $i \in C$ is chosen as most similar to the target is taken to be proportional to its similarity metric. Thus, both CKL and t-STE are similarity choice models that satisfy the IIA property. 

% This work provides a systematic study to explore whether the IIA holds in similarity choice data. The study is posed as a goodness of fit test \citep{Lehmann2022}: can similarity choice models with IIA fit empirical data well? We use two tests for this task: a classical $\chi^2$ test and a closely related Bayesian test based on Posterior Predictive Checks~\citep{gelman2013philosophy} (Section \ref{sec:models_methods}). First, we demonstrate the soundness of these methods by applying them to simulated data; we observe that both tests improve their rejection of the IIA hypothesis (i.e., $p$-value) as the strength of IIA violations in the data increases (Section \ref{sec:synthetic}). These methods are then applied to two human similarity comparisons datasets collected in this work through carefully designed surveys. While both datasets have the same universe of items, similarity questions in the first were handcrafted by us and completely randomized in the second. Results show statistically significant evidence of IIA being violated in both datasets, as well as data being better represented by a simple perturbation model that violates IIA (Section \ref{sec:experiments}). 

% It is known that population mixtures can lead to IIA violations in models and data ~\citep{train2009discrete}. As a final step, population homogeneity is statistically tested in the surveys. Apart from one participant, all other are within statistical error of a homogeneous population (Section \ref{sec:pop_homogeneity}). This gives stronger evidence that the IIA violations stem from \textit{context effects}: the influence that the choice-set has on the relative odds of its items.

% The strongest and seminal evidence of IIA violation in similarity choice data in the literature is provided by \cite{tversky1977features}. His empirical work consists of carefully crafted similarity questions that no only violated IIA, but also allowed him to explain the outcome (at a qualitative level). Our work demonstrates IIA violations occur not just for carefully crafted similarity questions, but also for randomly generated questions. Given the applications of similarity choice models highlighted above, many of which do not (yet) account for context effects, we believe our work will motivate the development of richer similarity choice models that can accommodate context effects (and IIA violations), a clear and important line for future work (see discussion in Section~\ref{sec:summary}).
