% \documentclass{uai2023} % for initial submission
\documentclass[accepted]{uai2023} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like

%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2023} % ptmx math instead of Computer
% Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2023} % newtx fonts (improves upon
 % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

\usepackage{algorithm}
\usepackage{algorithmic}
\usepackage{amsfonts}
\usepackage{amsmath}
\usepackage{caption}
\usepackage{subcaption}
\usepackage{float}
\usepackage{bm}
\usepackage{graphicx}
\usepackage{tabularx}
\usepackage{wrapfig}
\usepackage{enumitem}

% math defns
% 	vector notation
\newcommand{\vct}[1]{\boldsymbol{#1}}
%   matrices
\newcommand{\mtx}[1]{\boldsymbol{#1}}

\newcommand{\vx}{\vct{x}}
\newcommand{\vz}{\vct{z}}

\newcommand{\mG}{\mtx{G}}
\newcommand{\mZ}{\mtx{Z}}

\newcommand{\E}{\operatorname{\mathbb{E}}}


%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams


\author{%
  Namrata Nadagouda \qquad Austin Xu \qquad Mark A. Davenport
  ~\\
  School of Electrical and Computer Engineering\\
  Georgia Institute of Technology\\
  Atlanta, Georgia, USA 
}

\title{Active Metric Learning and Classification using Similarity Queries}

\begin{document}

\maketitle

%abstract
\begin{abstract}
Active learning is commonly used to train label-efficient models by adaptively selecting the most informative queries. However, most active learning strategies are designed to either learn a representation of the data (e.g., embedding or metric learning) or perform well on a task (e.g., classification) on the data. However, many machine learning tasks involve a combination of both representation learning and a task-specific goal. Motivated by this, we propose a novel unified query framework that can be applied to any problem in which a key component is learning a representation of the data that reflects similarity. Our approach builds on similarity or nearest neighbor (NN) queries which seek to select samples that result in improved embeddings. The queries consist of a reference and a set of objects, with an oracle selecting the object most similar (i.e., nearest) to the reference. In order to reduce the number of solicited queries, they are chosen adaptively according to an information theoretic criterion. We demonstrate the effectiveness of the proposed strategy on two tasks -- active metric learning and active classification -- using a variety of synthetic and real world datasets. In particular, we demonstrate that actively selected NN queries outperform recently developed active triplet selection methods in a deep metric learning setting. Further, we show that in classification, actively selecting class labels can be reformulated as a process of selecting the most informative NN query, allowing direct application of our method.
\end{abstract}

\section{Introduction}\label{sec:intro}

\begin{figure*}[t]
\centering
\begin{minipage}{\textwidth}
    \centering
    \includegraphics[scale=0.12]{figures/Info-NN-Figure.png}
    \caption{Visualization of the unified query solicitation framework with an example query. Candidate NN queries to be evaluated by the active NN query selection method are formed based on the setting (metric learning or classification). The oracle then responds to the most informative of these queries. In the case of metric learning, the response is utilized to place the reference closer to the similar item while for classification, the response is equivalent to a label corresponding to the reference (e.g., cake in this case). }
    \label{fig:unified_framework}
\end{minipage}
\end{figure*}

A defining feature of modern machine learning is a reliance on large volumes of human-labeled data. Perhaps the most prominent example is the existence of massive hand-labeled image datasets, but the task of acquiring large amounts of human-provided data is nearly ubiquitous in machine learning. However, such data is not free; it is often tedious and expensive to gather a sufficient number of query responses to satisfy data hungry machine learning models.

Active learning (AL) \citep{settles2009active} seeks to mitigate this issue by carefully selecting only the most informative samples to be labeled. More generally, AL attempts to identify the most informative queries to pose to an oracle. These queries can include asking for a class label or rating, or more general relational queries such as the similarity (or dissimilarity) of different items. In this paper, we focus on metric learning from perceptual similarity queries and classification, two prominent application areas for AL, and show that despite the different queries being posed to the oracle (labels in classification vs.\ similarity judgements for metric learning), there is a fundamental connection between the two problems.

Learning an embedding or representation of the data that accurately reflects similarity between items is the goal of metric learning. Many approaches in metric learning aim to make inter-class item distances small and intra-class item distances large by using triplets of items consisting of an anchor point, a positive sample of the same class as the anchor, and a negative point of a different class \citep{hoffer2015deep}. Class labels are used as a proxy for item similarity/dissimilarity, which is only feasible if class labels are widely available. However, when given a new (unlabeled) dataset, we cannot apply this approach without manually labeling large amounts of data, and it is far from clear that class labels are the most effective mechanism for learning about similarity. We focus on one way to avoid this issue, which is to directly query an oracle for perceptual similarity information, as is done in \citet{kumari2020batch}, where triplets of the form ``Is item B or item C more similar to item A?'' are actively selected for learning an embedding of items. Active deep metric learning (DML) builds on this idea by finding the most informative queries to ask the oracle.

While seemingly dissimilar from metric learning, contemporary classification relies on models (e.g., neural networks) with the ability to learn good representations of the data from training data. Active classification focuses on how to best solicit labels for unlabeled data points, with many modern approaches either implicitly or explicitly relying on representations learned by the model to determine the most informative label. Methods that use metrics based on the predicted class probabilities, such as uncertainty \citep{gal2017deep} or consistency \citep{gao2019consistency}, implicitly rely on such representations, whereas core-set based approaches \citep{sener2017active, pinsler2019bayesian} directly use learned representations to select diverse samples. Thus, if we seek the most informative labels with respect to improving the learned representation of our classification model, the goals of active classification and active metric learning are aligned. Despite these commonalities and virtually identical learning frameworks for the two problems, to the best of our knowledge, there is no approach for query selection that is problem agnostic. In this paper, we present a unified framework, which is made feasible by a novel type of similarity query that applies to both DML and classification.

Specifically, we consider the \emph{nearest neighbor} (NN) query, which, given a reference data point $r$, asks an \emph{oracle} (e.g., a human expert) to select the most similar point from among a set of $M$ alternatives $t^1, t^2, \dots, t^M$. We denote this a length $M$ NN query. With the goal of minimizing the required number of queries, we adapt an \emph{active} query selection strategy to this query type. We take an information theoretic approach and estimate the gain in \emph{mutual information} (conditioned on previous query responses) as the criteria for selecting the most informative query, an approach that we dub \emph{Info-NN}. 

To the best of our knowledge, we are the first to study this query type. Similar ideas have been explored before, such as using UI configurations to collect multiple triplets at once \citep{wilber2014cost}, enforcing a class-similarity based quadruplet loss (one anchor, one positive point, two negative points) \citep{chen2017beyond}, and soliciting ranking queries \citep{canal2020active}. Of these approaches, \citet{chen2017beyond} is the most similar, but NN queries are 1) not confined to a particular fixed length, and 2) not restricted to using class information. The first difference allows us to generalize to any classification problem and the second allows us to collect similarity information of items of the same class, or in cases where class labels are not available.

\paragraph{Contributions.} Our main contributions are as follows. 
\begin{enumerate}[leftmargin=*, noitemsep, topsep=0pt]
    \item We propose a novel type of similarity query, called the NN query (Sec.~\ref{sec:methods}).
    \item We re-cast active classification as finding the most informative NN query, which allows us to unify active classification and active DML under one framework. This framework is flexible enough to accommodate \textit{any} active NN query selection method (Secs.~\ref{sec:classification_as_NN} and~\ref{sec:unified_framework}). 
    \item We empirically validate \footnote{Code available at \url{https://github.com/nnadagouda95/InfoNN}.} 
    DML and classification performance using our unified framework and novel NN query selection method (Secs.~\ref{sec:metric_learning} and ~\ref{sec:classification}).
\end{enumerate}


\section{Background and Related Work}\label{sec:related}

\paragraph{Metric learning.}
Learning embeddings from similarity-based comparisons has been previously studied in a variety of scenarios \citep{agarwal2007generalized, van2012stochastic, terada2014local, amid2015multiview, kleindessner2017kernel,karaletsos2015bayesian, Veit_2017_CVPR, ma2019robust, ghosh2019landmark}, spanning everything from utilizing non-metric multidimensional scaling (MDS) to accommodating noisy/corrupted triplets to examining deeper connections to kernels. The importance of learning meaningful embeddings is shown in various applications such as face verification \citep{sankaranarayanan2016triplet}, fine-grained classification \citep{Wah_2014_CVPR}, extracting usable information from crowd-sourcing \citep{kajino2012convex}, and even fashion recommendations \citep{vasileva2018learning}. To complement these techniques, active query selection methods have been developed which examine uncertainty \citep{tamuz2011adaptively}, exploit a low-dimensionality \citep{jamieson2011low}, incorporate auxiliary features \citep{heim2015active}, and utilize Bayesian techniques \citep{lohaus2019uncertainty}. However, all of these methods are designed for non-parametric embedding techniques (e.g., MDS) which cannot easily generate a corresponding embedding given new items.

More recently, deep metric learning (DML) has aimed to overcome these limitations~\citep{kaya2019deep}. DML trains a neural network to learn an embedded representation that respects similarity information. In particular, many triplet-based DML methods assume knowledge of class labels for items, and attempt to minimize inter-class distances while maximizing intra-class distances \citep{hoffer2015deep, ge2018deep, chen2017beyond}. Although class labels may not always be available, very few works consider the case of DML with perceptual similarity queries, especially in an active manner. Recently, active similarity query selection methods for DML that focus on finding batches of non-redundant \textit{triplets} have been proposed \citep{kumari2020batch} by encouraging both informativeness (measured by entropy) and diversity (through a variety of heuristic approaches) within the selected batch. Our method adopts a similar framework as \citet{kumari2020batch}, but we utilize mutual information to find informative NN queries.

\paragraph{Classification.}\label{sec:related_active}
Traditionally, active learning has been used with support vector machines and Gaussian processes for image classification \citep{joshi2009multi, tuia2009active, kapoor2007active, houlsby2011bayesian}. More recently, a variety of active methods based on uncertainty \citep{gal2017deep, wang2016cost, kirsch2019batchbald,  song2019combining}, diversity \citep{sener2017active, pinsler2019bayesian, kirsch2019batchbald}, and consistency \citep{gao2019consistency} have been used for training deep neural networks in the supervised and semi-supervised classification settings. In these settings, the goal is to learn a model for predicting the class probabilities on a dataset consisting of points belonging to $C$ classes. We assume access to an initial labeled and unlabeled 
set of samples. The samples from the unlabeled pool are iteratively evaluated for informativeness and labeled accordingly. Based on feedback from the oracle, we can learn a model in either supervised (using only the labeled data) or in semi-supervised (using all data) settings.

Some active classification approaches \citep{houlsby2011bayesian, kirsch2019batchbald} consider mutual information between the model parameters and the predicted class probabilities to select the most informative samples, while some others \citep{sener2017active, pinsler2019bayesian} follow a coreset based approach to select a subset of diverse samples such that the model learned with these samples best approximates the one learned on the entire data. In \citet{sener2017active}, the authors use the features learned by the model to select the samples such that the maximum distance between an unlabeled sample and its nearest labeled sample is minimized. The method in \citet{pinsler2019bayesian} chooses samples such that the model posterior with the selected samples best approximates the posterior with the complete data. 

Our method derives inspiration from \citet{houlsby2011bayesian, gal2017deep, kirsch2019batchbald} in using mutual information to evaluate informativeness, but we consider mutual information between the features and the predicted class probabilities computed based on the inter-sample distances in the feature space. Our approach is similar to the work of \citet{sener2017active} in that both use the Euclidean distances of the features learned by the neural network. However, their focus is only on coverage of the entire feature space, whereas we select samples with the goal of improving the learned embedding. Apart from these, there are a few works that focus on active discriminative representation learning. In \citet{zhang2017active}, the authors propose an AL approach for text classification that selects instances containing words which are likely to most affect the embeddings by computing the \textit{expected gradient length} with respect to the embeddings. A multi-armed bandit based method that uses networking data and learned representations for adaptively labeling informative nodes is suggested in \citet{gao2018active} to learn network representations. However, to the best of our knowledge, no framework of active representation learning has been applied to image classification before and none of the above methods propose a generalized querying strategy.


\section{Unified Framework and Active Query Selection}\label{sec:methods}

\begin{figure}
    \centering
    \includegraphics[width=0.35\textwidth]{figures/nnquery.png}
    \caption{Example of an unlabeled $\vz_i$ and the nearest labeled neighbors to $\vz_i$ from each class: $\vz_i^{(1)}, \vz_i^{(2)}, \vz_i^{(3)}, \vz_i^{(4)}$. In this example, we might expect that the most likely label would be $y_i = 4$, which could be interpreted as a nearest neighbor query response (that $\vz_i^{(4)}$ is the nearest neighbor).} 
    \label{fig:nnquery}
\end{figure}

In this section, we provide an overview of our proposed generalized query framework. Specifically, we show that in any classification setting where a latent representation of the data is learned, querying an oracle for a class label can be re-formulated as soliciting the oracle's feedback for an NN query, allowing us to draw the connection to metric learning. We also present Info-NN, an active method of selecting NN queries using information theoretic criterion.

Formally, an NN query $Q_i = r_i~ \cup~ T_i$ of length $M$ consists of a reference data point $r_i$ and a set of data points $T_i = \{ t_i^{(1)}, t_i^{(2)}, \dots, t_i^{(M)} \}$, from which the oracle picks the point most similar to the reference $r_i$. Let $Y_i \in \{1, 2, \dots, M\}$ be the random variable indicating the oracle's response to the $i^{th}$ query. When $Y_i = m$, this indicates that the oracle selected $t_i^{(m)} \in T_i$ as the most similar to the reference $r_i$. A visual example of the NN query can be found in Fig. \ref{fig:unified_framework}.

\subsection{Classification as an NN Query Selection Problem}\label{sec:classification_as_NN}

We approach AL for classification as one chiefly of selecting labels that will improve the feature representation, as most modern classification techniques (e.g., neural networks) can be interpreted as learning an embedding that enables simple linear classifiers to be effective. We do this via an analogy in which obtaining the class label for an unlabeled sample is equivalent to a particular NN query response. 
   
Consider a dataset $\mathcal{X} = \{\vx_i\}_{i=1}^N$ consisting of points belonging to $C$ classes, $\{y_i\}_{i=1}^N \in \{1, 2, \dots, C\}$. We assume access to an initial labeled, $\mathcal{L} = \{\vx_i, y_i\}_{i=1}^j$ and unlabeled, $\mathcal{U} = \{\vx_i\}_{i=j+1}^N$ set of samples. Let $\mZ = \{\vz_i\}_{i=1}^N$ represent initial estimates of the embeddings for the dataset according to a model learned on an initial set of labeled samples. Now suppose we want to choose a new point $\vx_{j+1}$ from $\mathcal{U}$ whose label $y_{j+1}$ we will obtain. For any $\vx_i$ in $\mathcal{U}$, consider its embedding $\vz_i$ in the feature space and the nearest neighbor to $\vz_i$ from $\mathcal{L}$ for each class, i.e., 
\[
\vz_i^{(c)} = \mathop{\arg \min}_{\vz_\ell \in \mathcal{L}_c} \|\vz_{\ell} -  \vz_i \|_2,
\]
where $\mathcal{L}_c = \{ \vz_{\ell} : (\vx_{\ell},y_{\ell}) \in \mathcal{L}, y_{\ell} = c\}$.  An example of an unlabeled $\vz_i$ and the nearest labeled neighbors to $\vz_i$ from each class is illustrated in Fig.~\ref{fig:nnquery}. 

Note that if the embedding that we have learned does a reasonable job of representing similarity (as it pertains to the task of classification), then we would expect that the most likely label for $\vz_i$ would correspond to the class $c$ for which $\vz_i^{(c)}$ is closest to $\vz_i$. Thus, we can interpret the label $y_i$ as a response to a length $C$ nearest neighbor query in which $\vz_i$ is the reference to which $\vz_i^{(1)}, \vz_i^{(2)}, \ldots, \vz_i^{(C)}$ are compared. (For computational reasons, one may choose to not use all $C$ nearest neighbors in practice.) Because this NN query response reveals information about the relative locations of items in the learned representation, retraining the classification model with the new oracle response should improve the representation. \textbf{This is the key idea behind our approach: select NN queries (or equivalently, points to label) that result in the best improvement of the embedding.} 

\subsection{Unified Framework for Active Classification and Metric Learning}\label{sec:unified_framework}
This view of active classification gives rise to a unified framework which can be used in either active classification or active DML: from a pool of candidate NN queries, choose the most informative query to ask the oracle, then re-train the model to incorporate the newly acquired query response. \textbf{Despite each problem seemingly requiring fundamentally different oracle responses (similarity information vs.\ labels), both problems can be tackled utilizing NN queries, and thus, the same active query selection strategy.} The main difference is the pool of candidate queries. In active DML, we can query the oracle for similarity information about any set of items, whereas in active classification, the pool of candidate NN queries is restricted to queries that contain one item from every class. This pool of candidate queries is formed by setting every $\vz_i$ corresponding to an $\vx_i \in \mathcal{U}$ as the reference point, and finding (up to) $C$ nearest neighbors of differing classes. \textbf{A critical feature of this unified framework is that it does not depend on which measure of ``informativeness'' is used. This allows for a practitioner to plug-in their desired active query selection criteria without making any modifications to the framework}, as depicted in Fig. \ref{fig:unified_framework}. In our experiments, we select the queries that maximize \textit{mutual information} for both active DML and classification experiments. In particular, we utilize two methods for computing mutual information, including a novel approach dubbed Info-NN.

\subsection{Active Query Selection via Info-NN}\label{sec:info-nn}

\begin{algorithm}[t]
\caption{Info-NN-embedding}
\label{alg:info-nn-emb} 
\textbf{Input:} Embedding $\mZ$, candidate queries $Q$, num. samples $n_s$, variance $\sigma^2$
\begin{algorithmic}
    \STATE $I \leftarrow $ empty list of size $|Q|$ (Mutual information values for candidate queries)
    \STATE $p_{k}, H_{k} \leftarrow $ empty lists of size $|Q|$
    \FOR{$n = 1$ \textbf{to} $n_s$}
    \STATE $\tilde{\mZ}_n \leftarrow \mZ + \mG$, elements of $\mG$ drawn i.i.d from $\mathcal{N}(0, \sigma^2)$.
        \FOR{$Q_i \in Q$}
            \STATE $r_i \leftarrow $ first element of $Q_i$
            \STATE $T_i \leftarrow Q_i \backslash \{r_i\}$
            \STATE $D_{Q_i} \leftarrow$ distance of every item in $T_i$ to $r_i$ in $\tilde{\mZ}_n$
            \STATE $Y_i \leftarrow$ query response using $D_{Q_i}$
            \STATE $p_{k}[Q_i] \leftarrow p_{k}[Q_i] + p(Y_i | D_{Q_i})$ (cumulative probability)
            \STATE $H_{k}[Q_i] \leftarrow H_{k}[Q_i] + H[p(Y_i | D_{Q_i})]$ (cumulative entropy)
        \ENDFOR
    \ENDFOR\\
    \FOR {${Q_i} \in Q$}
        \STATE $I[Q_i] \leftarrow H \left[ \frac{p_{k}[Q_i]}{n_s} \right] - \frac{H_{k}[Q_i]}{n_s}$
    \ENDFOR
\end{algorithmic}
\textbf{Output:} $I$
\end{algorithm}


The main idea behind our selection strategy is to select queries that are maximally informative about the embedding while avoiding ones that do not provide new information. This goal is achieved by using mutual information between the embedding and a query as a measure of the informativeness of the query.
Let $y^{i-1} = \{ y_1, y_2, \dots, y_{i-1} \}$ denote the set of all responses (true labels for the selected samples) obtained after $i-1$ queries. We denote $Y_i$ to be the random variable corresponding to the oracle's response to query $Q_i$. The Plackett-Luce (PL) model \citep{turner2018introduction}, which is an extension of the triplet model commonly used with similarity comparisons \citep{tamuz2011adaptively}, is used to model the response. For an NN query of length $M$, the probability of $t_i^{(m)}$ being the nearest or the most similar point to $r_i$ is modeled as 
\begin{equation}\label{eq:prob_model}
    p(Y_i = m ~|~ D_{Q_i}) = \frac{ (D^2_{im} + \mu)^{-1}}{\sum_{j=1}^M (D^2_{ij} + \mu)^{-1}}
\end{equation}
where $D_{Q_i} \coloneqq \{ D_{im} ~:~ t_i^{(m)} \in T_i \}$, $D_{im}$ denotes the distance between the embeddings of $r_i$ and $t_i^{(m)}$, and $\mu$ is a parameter set by the user.  This model captures uncertainty in the oracle responses as well as uncertainty in our current estimate of the embedding (and hence distances). The parameter $\mu$ is indicative of our confidence in the distances. Note that even though we use this model in our query selection strategy, we do \emph{not}  require that query responses are generated according to the PL model.


Now consider the mutual information between the embedding $\mZ$ and the response $Y_i$:
\begin{equation}\label{eq:MI_1}
I(\mZ ; Y_i   \: | \:   y^{i-1}) = H[\mZ   \: | \:   y^{i-1}] - \underset{Y_i}{\E} (H[\mZ   \: | \:   Y_i, y^{i-1}]).
\end{equation}
This quantity measures how much information the response to query $Q_i$ would provide about the embedding, conditioned on the fact that we have already acquired the responses $y^{i-1}$ to the previous queries. This is exactly what we would like to use to select informative queries, but computing this quantity in the above form is computationally expensive. To compute this in a na\"{i}ve manner we would need to find the estimate of the embedding for every possible response to the query and compute the entropies of these estimates in the high dimensional embedding space. Fortunately, using an approach similar to~\citet{houlsby2011bayesian}, we can use the symmetry of mutual information to re-write~\eqref{eq:MI_1} as
\begin{equation}\label{eq:MI_2}
I(Y_i ;\mZ   \: | \:   y^{i-1} ) = H[Y_i    \: | \:   y^{i-1} ] - \underset{\mZ}{\E} (H[Y_i    \: | \:   \mZ, y^{i-1} ]).
\end{equation}
We can now compute entropies in the response space, which is usually much smaller than the embedding space. This second form of mutual information also provides an interesting insight about the selection strategy. The first term, which denotes the entropy of the predicted response, encourages the selection of queries which are highly uncertain for the current estimate of the embedding. The second term denotes the expected entropy of the responses predicted by the individual samples from the distribution over the embedding estimate and encourages queries for which the individual samples are fairly confident. This simultaneously avoids the acquisition of redundant queries and queries for which the oracle response is likely to be uncertain.

\begin{algorithm}[t]
\caption{Info-NN-distances}
\label{alg:info-nn-dist} 
\textbf{Input:} Embedding $\mZ$, candidate queries $Q$, num. samples $n_s$, variance $\sigma^2$
\begin{algorithmic}
    \STATE $I \leftarrow $ empty list of size $|Q|$ (Mutual information values for candidate queries)
    \FOR{$Q_i \in Q$}
    \STATE $r_i \leftarrow $ first element of $Q_i$
    \STATE $T_i \leftarrow Q_i \backslash \{r_i\}$
    \STATE $D_{Q_i} \leftarrow$ distance of every item in $T_i$ to $r_i$ in $\mZ$
    \STATE $Y_i \leftarrow$ query response using $D_{Q_i}$
    \STATE $D_s \leftarrow n_s$ copies of $\mathcal{N}(D_{Q_i}, \sigma^2)$
    \STATE $I_{Q_i} \leftarrow H \left[ \sum\limits_{D \in D_s} \frac{\left( p(Y_i    \: | \:   D) \right)}{n_s} \right] - \sum\limits_{D \in D_s} \frac{H \left[ p(Y_i    \: | \:   D) \right]}{n_s}$
    \STATE $I[Q_i] \leftarrow I_{Q_i}$
    \ENDFOR\\
\end{algorithmic}
\textbf{Output:} $I$
\end{algorithm}


\paragraph{Estimation of mutual information.} Computing the mutual information as in~\eqref{eq:MI_2} requires a probabilistic estimate of the embedding. However, in many learning scenarios, only point estimates are computed. We place the assumption of normal distributions and utilize two Monte Carlo sampling based methods for tractable computation of mutual information. The first method, which we refer to as \emph{Info-NN-embedding}, assumes that the embedding values are normally distributed, with mean equal to the previous estimate of the embedding. With this assumption, we have a tractable means of computing the mutual information. We can further increase computational efficiency by making the stronger assumption that inter-item distances in the embedding are normally distributed, with mean equal to the previous estimates of the distances. We refer to this approach as \emph{Info-NN-distances}. In general, we use \emph{Info-NN-distances} for experiments dealing with real-world data, and \emph{Info-NN-embedding} for synthetic experiments. The two methods are presented in Alg.~\ref{alg:info-nn-emb} and Alg.~\ref{alg:info-nn-dist}, respectively.  

To enable efficient computation of mutual information in practice, we make a few more simplifying assumptions. We follow a similar approach as the one presented in \citet{canal2020active} and adapt it to NN queries. For Info-NN-embedding, the computation of $H[Y_i    \: | \:   y^{i-1} ]$ and the corresponding assumptions are described below.
 \begin{align*}
     H[Y_i    \: | \:   y^{i-1} ] &= H[\underset{\mZ}{\E}(p(Y_i | \mZ, y^{i-1})| y^{i-1})]\\
     &= H[\underset{\mZ}{\E}(p(Y_i | \mZ)| y^{i-1})]\tag{I}\\
     &= H[\underset{\mZ_{Q_i}}{\E}(p(Y_i | \mZ_{Q_i})| y^{i-1})]\tag{II}\\
     &= H[\underset{\mZ_{Q_i}}{\E}(p(Y_i | \mZ_{Q_i})| \mZ^{i-1})]\tag{III}\\
     &= H[\underset{\mZ_{Q_i} \sim \mathcal{N}(\mZ_{Q_i}^{i-1}, \sigma^2_{i-1})}{\E}(p(Y_i | \mZ_{Q_i}))]\tag{IV}
 \end{align*}

\begin{enumerate}[label=(\Roman*)]
    \item The response $Y_i$ is independent of past responses $y^{i-1}$, when conditioned on $\mZ$. 
    \item The oracle's response conditioned on $\mZ$, depends only on $\mZ_{Q_i}$ - embeddings of the items involved in the query and is independent of the embeddings $\mZ_{s \notin Q_i}$.
    \item $\mZ$ is independent of $y^{i-1}$. given the previous estimate of the embedding $\mZ^{i-1}$.
    \item Conditioned on $\mZ^{i-1}$, the $(a, b)^{th}$ entry of $\mZ$, $\mZ_{a,b}$, is distributed normally with mean $\mZ^{i-1}_{a,b}$ and variance $\sigma^2_{i-1}$. We will slightly abuse notation, and write $\mZ \sim \mathcal{N}(\mZ^{i-1}, ~\sigma^2_{i-1})$.
\end{enumerate}

Following a similar process, we have 
\begin{equation*}
    \underset{\mZ}{\E} (H[Y_i    \: | \:   \mZ, y^{i-1} ]) = \underset{\mZ_{Q_i} \sim \mathcal{N}(\mZ_{Q_i}^{i-1}, \sigma^2_{i-1})}{\E}(H[p(Y_i | \mZ_{Q_i})]).
\end{equation*}
We can now utilize Monte Carlo sampling methods for estimating the entropies, as presented in Alg. \ref{alg:info-nn-emb}

For Info-NN-distances, we make the same assumptions as above, except for (IV). Instead, we assume that the distances between data points are distributed normally with the mean for each pair set equal to the distance computed from the estimated embedding matrix. This assumption enables an efficient method of estimating the posterior distribution over the distances and makes the computation of mutual information more efficient.   Specifically, the entropies in Eq.~\ref{eq:MI_2} can be computed as follows: 
\begin{align}
H[Y_i    \: | \:   y^{i-1} ] &= H \left[ \underset{\mZ}{\E} \left( p(Y_i    \: | \:   \mZ, y^{i-1} )   \: | \:   y^{i-1}  \right) \right] \notag \\ 
&= H \left[ \underset{D_{Q_i} \sim \mathcal{N}_{Q_i}^{i-1}}{\E} \left( p(Y_i    \: | \:   D_{Q_i}) \right) \right]\label{eq:ent_1}
\end{align}
and
\begin{align}
\underset{\mZ}{\E} (H[Y_i    \: | \:   \mZ, y^{i-1} ]) &= \underset{\mZ}{\E} \left( H \left[ p(Y_i    \: | \:   \mZ, y^{i-1} )   \: | \:   y^{i-1}  \right] \right) \notag \\
&= \underset{D_{Q_i} \sim \mathcal{N}_{Q_i}^{i-1}}{\E} \left( H \left[ p(Y_i    \: | \:   D_{Q_i}) \right] \right)\label{eq:ent_2},
\end{align}
where $\mathcal{N}_{Q_i}^{i-1} \coloneqq \mathcal{N}(D_{Q_i}, \sigma^2_{i-1})$. Due to this normal distribution assumption, the entropy computations in~\eqref{eq:ent_1} and ~\eqref{eq:ent_2} are straightforward calculations. The full procedure is shown in Alg.~\ref{alg:info-nn-dist}


\section{Experiments}\label{sec:experiments}

\subsection{Deep Metric Learning} \label{sec:metric_learning}

\begin{figure*}[ht!]
    \hfill
    \begin{minipage}{0.27\linewidth}
        \centering
        \includegraphics[width=\textwidth]{figures/dml/synthetic_icml22_info_rand_euclid_cent_tga.png}
        \label{fig:dml_synthetic}
    \end{minipage}
    \hfill
    \begin{minipage}{0.27\linewidth}
        \centering
        \includegraphics[width=\textwidth]{figures/dml/food_info_rand_euclid_cent_tga.png}
        \label{fig:dml_food}
    \end{minipage}
    \hfill
    \begin{minipage}{0.27\linewidth}
        \centering
        \includegraphics[width=\textwidth]{figures/dml/adm_info_rand_euclid_cent_tga.png}
        \label{fig:dml_admissions}
    \end{minipage}
    \hfill
    \vskip 0pt
    \hfill
    \begin{minipage}{0.27\linewidth}
        \centering
        \includegraphics[width=\textwidth]{figures/dml/synthetic_icml22_info_rand_euclid_cent_tga_queries.png}
        \label{fig:dml_synthetic_queries}
    \end{minipage}
        \hfill
    \begin{minipage}{0.27\linewidth}
        \centering
        \includegraphics[width=\textwidth]{figures/dml/food_info_rand_euclid_cent_tga_queries.png}
        \label{fig:dml_food_queries}
    \end{minipage}
    \hfill
    \begin{minipage}{0.27\linewidth}
        \centering
        \includegraphics[width=\textwidth]{figures/dml/adm_info_rand_euclid_cent_tga_queries.png}
        \label{fig:dml_admissions_queries}
    \end{minipage}
    \hfill
    \vskip 0pt
\caption{Per-triplet (top) and per-query (bottom) TGA comparison of Info-NN against active batch triplet methods and random queries on synthetic (left) and food (center), and Graduate Admission (right) datasets. Info-NN outperforms random and batch methods, and NN queries exhibit far superior per-query performance, requiring less interactions with the oracle.}
\label{fig:dml_experiments}
\end{figure*} 

\begin{figure*}[ht!]
\begin{minipage}{\linewidth}
    \centering
    \includegraphics[width=0.75\linewidth]{figures/dml/info_nn_2_vis.png}
\end{minipage}
\caption{Visualization of food embedding learned using queries selected with Info-NN, generated using t-SNE \cite{maaten2008visualizing}. Similarly tasting objects are generally grouped together, such as vegetables (center) and fruits (top left)}
\label{fig:food_viz}
\end{figure*} 

In this section, we directly query an oracle with NN queries and learn a similarity embedding from query responses using a Deep Metric Learning (DML) framework.

\paragraph{Active embedding framework.}
We utilize a neural network to learn an embedding that matches the oracle's responses to similarity queries. Because a length $M$ NN query can be decomposed into $M-1$ triplets, we utilize a triplet loss \citep{weinberger2006distance}. We initialize our network with a random batch of triplets, then select batches of $B$ queries, receive oracle responses to the selected queries, add the new queries to the pool of already answered queries, and re-train our network for $100$ epochs using all prior query responses. For each experiment, we select a pool of $20,000$ training length-3 NN queries and $20,000$ testing length-3 NN queries from the set of all possible queries (decomposing NN queries into triplets for triplet based methods). 

In scenarios where re-training the network many times is computationally expensive, batch methods that select multiple queries at once are preferable. We compare the performance of Info-NN to recently developed triplet batch methods \citep{kumari2020batch}. While Info-NN can identify informative queries, batches of the most informative queries at a fixed instance may result in poor diversity of queries, as the most informative queries often cover the same areas of the space. Therefore, we utilize a very simple batch extension for DML experiments. For a batch of $B$ queries, we select the top $B^\prime \leq B$ most informative queries, then select $B - B^\prime$ queries uniformly at random from the query pool. We show that simply augmenting randomly selected queries with a set of the most informative queries can outperform methods designed specifically for batch query selection.

In our experiments, \emph{Info-NN-M} means the batch variant of Info-NN described above was used to select NN queries of length $M$, while \emph{Batch-Euclidean/Centroid} indicate methods proposed in \citet{kumari2020batch}. Finally, \emph{Random} means the query type (NN or triplet) was constructed by selecting queries uniformly at random from the training set. Precise experimental details can be found in the appendix.

\begin{figure*}[ht!]
    \hfill
    \begin{minipage}{0.27\linewidth}
        \centering
        \includegraphics[width=\textwidth]{figures/dml/adm_info_rand_euclid_cent_recall1.png}
        \label{fig:dml_adm_recall}
    \end{minipage}
    \hfill
    \begin{minipage}{0.27\linewidth}
        \centering
        \includegraphics[width=\textwidth]{figures/dml/adm_info_rand_euclid_cent_topk1nn.png}
        \label{fig:dml_adm_topk}
    \end{minipage}
    \hfill
    \begin{minipage}{0.38\linewidth}
        \centering
        \includegraphics[width=0.75\textwidth]{figures/dml/adm_info_nn_vis.png}
        \label{fig:dml_adm_info_nn_vis}
    \end{minipage}
    \hfill
    \vskip 0pt
    \hfill
    \begin{minipage}{0.27\linewidth}
        \centering
        \includegraphics[width=\textwidth]{figures/dml/adm_info_rand_euclid_cent_recall1_queries.png}
        \label{fig:dml_adm_recall_queries}
    \end{minipage}
    \hfill
    \begin{minipage}{0.27\linewidth}
        \centering
        \includegraphics[width=\textwidth]{figures/dml/adm_info_rand_euclid_cent_topk1nn_queries.png}
        \label{fig:dml_adm_topk_queries}
    \end{minipage}
    \hfill
    \begin{minipage}{0.38\linewidth}
        \centering
        \includegraphics[width=0.75\textwidth]{figures/dml/adm_centroid_vis2_nolegend.png}
        \label{fig:dml_adm_cent_vis}
    \end{minipage}
    \hfill
    \vskip 0pt
\caption{ Per-triplet (top) and per-query (bottom) comparison for Info-NN against other methods Recall@$1$ (left) and TopFraction@$21$ (center). NN queries result in objects of the same class to be more nearby and group admitted students together, with Info-NN exhibiting the best performance of all methods tested. Visualization of embedding learned using Info-NN (right-top) and Batch-Centroid (right-bottom), generated using t-SNE \citep{maaten2008visualizing}. Info-NN groups more highly rated candidates closer together.}
\label{fig:dml_recall_topk_vis}
\end{figure*}

\paragraph{Datasets and evaluation metrics.}
We test our active embedding technique on a variety of datasets:
\begin{itemize}[leftmargin=*,noitemsep, topsep=0pt]
    \item \textbf{Synthetic Mahalanobis Dataset:} We generate $N = 100$ items of dimension $D = 10$ from a standard normal distribution. The oracle makes perception judgements based on some randomly generated Mahalanobis metric. We introduce artificial noise by corrupting 25\% of all training queries to assess the robustness of our embedding method. We collect batches of size $B = 10$. Info-NN-embedding is used in these experiments.
    \item \textbf{Food73 Dataset:} This dataset contains $72,148$ crowdsourced triplets gathered for $73$ different food items~\citep{wilber2014cost}. We utilize $6$ L1 normalized features (bitterness, saltiness, savoriness, spiciness, sourness, and sweetness) for each food item and form $1,047,251$ length $3$ NN queries from the collected triplets. The collected triplets, and, as a result, the formed NN queries contain inconsistencies. We collect batches of size $B = 30$. Info-NN-distances is used in these experiments.
    \item \textbf{Ranked Graduate Admissions Dataset:} We obtained partially ranked lists of 133 Ph.D. applicants to Georgia Tech School of Electrical and Computer Engineering. The top 22 candidates were accepted for admission, with the top 18 candidates individually ranked and the the rest of the candidates sorted into 5 tiers of varying sizes. Candidates fall into one of $7$ classes: Admitted with fellowship 1, admitted with fellowship 2, admitted without fellowship, reject (sorted into 4 tiers). For each candidate, we have access to $25$ features including GPA, letters of recommendation scores, and external fellowship application status. We form triplets across among the ranked candidates and between candidates of different tiers, resulting in $434,470$ triplets and $21,634,487$ length $3$ NN queries, and randomly corrupt 25\% of all queries to assess robustness. We collect batches of size $B = 30$. Info-NN-distances is used in these experiments.
\end{itemize}
To measure the performance of our embedding learning algorithm, we use \textit{triplet generalization accuracy}, which records the fraction of test triplets whose ordering is consistent with the learned embedding. Furthermore, for the Graduate Admissions Dataset, because we have access to class labels, we record Recall@$K$. Furthermore, to get a sense of how the algorithms group the admitted students, we compute TopFraction@K, which denotes the fraction of the $K$ nearest neighbors of the top $22$ (admitted) students that are admitted students. Because NN queries can be decomposed into triplets, we compare performance against triplet-based methods on both a \textit{per-triplet} basis and a \textit{per-query} basis (number of queries posed to the oracle). We report the median and 25\% and 75\% quantile over $20$ (synthetic), $10$ (food), and $10$ (admissions) trials.

\begin{figure*}[t]
    \hfill
    \begin{minipage}{0.27\linewidth}
        \centering
        \includegraphics[width=\textwidth]{figures/classification/mnist_baseline_comparison.png}
        \label{fig:mnist_baselines}
    \end{minipage}
    \hfill
    \begin{minipage}{0.27\linewidth}
        \centering
        \includegraphics[width=\textwidth]{figures/classification/cifar10_baseline_comparison.png}
        \label{fig:cifar10_baselines}
    \end{minipage}
    \hfill
    \begin{minipage}{0.27\linewidth}
        \centering
        \includegraphics[width=\textwidth]{figures/classification/svhn_baseline_comparison.png}
        \label{fig:svhn_baselines}
    \end{minipage}
    \hfill
    \vskip 0pt
\caption{Active classification performance comparison on MNIST (left), CIFAR-10 (center) and SVHN (right) datasets.}
\label{fig:classification_experiments}
\end{figure*}




\paragraph{Experiment results.}
As seen in Fig. \ref{fig:dml_experiments}, both versions of Info-NN are able to outperform recent methods developed specifically for batch query selection on both synthetic and challenging real-world datasets on both a per-triplet and a per-query basis. \textbf{This also demonstrates the flexibility of the unified framework; multiple active query selection methods can be plugged into the framework with consistently strong performance.} Furthermore, there seems to be minimal performance difference in selecting random NN queries vs.\ random triplets on a per-triplet basis, but using NN queries requires far fewer interactions with the oracle. From these experiments, it appears that the methods in \citet{kumari2020batch} require more of a ``warm up'' to catch up to random query performance, whereas Info-NN can consistently outperform random. Inspecting the visualization in Fig. \ref{fig:food_viz} of the learned food embedding also reveals that the embedding learned using Info-NN nicely separates savory foods from sweet foods, and can even group together similar foods, such as vegetables and fruits. Beyond triplet generalization accuracy, we can see in Fig.~\ref{fig:dml_recall_topk_vis} that Info-NN is able to outperform the same methods on both a per-triplet and per-query basis in Recall@$1$ and TopFraction@$21$, which suggests that Info-NN is more capable of grouping admitted students together. This can be visualized in Fig.~\ref{fig:dml_recall_topk_vis}, where top-rated students are more clearly grouped together in the embedding learned using Info-NN compared to the embedding learned with Batch-Centroid. Results for varying values of $K$ for Recall@$K$ and TopFraction@$K$ can be found in the appendix. We also note that Info-NN can be utilized in non-DML settings, such as using MDS \cite{tamuz2011adaptively}, and performs on par with more complex ranking queries (see appendix).



\subsection{Classification} \label{sec:classification}



We perform experiments on active image classification in a supervised setting, using NN queries to acquire labels iteratively. Info-NN-distances is used in these experiments.

\paragraph{Label selection and experimental framework.}
To select samples using Info-NN, for every unlabeled sample, we form the corresponding nearest neighbor query and compute an estimate of the information gain provided by that query. We then request a label for the unlabeled sample corresponding to the most informative query. A simple batch extension of our query acquisition strategy, which performs a K-Means clustering of the unlabeled samples in the embedding space and selects the most informative samples from every cluster, is used in the experiments. \emph{Info-NN-M} means the batch variant of Info-NN was used to select NN queries of length $M$. We use Euclidean distances between the features learned by the last hidden layer to compute distances for the probability model. We experimented with the length of the queries and illustrate plots for the best performing values. We plot the median of the accuracy values along with the $25\%$ and $75\%$ quantiles over $3$ trials. More details can be found in the appendix. 

We conduct experiments on \textbf{MNIST} \citep{lecun1998gradient}, \textbf{CIFAR-$10$} \citep{krizhevsky2009learning} and \textbf{SVHN} \citep{netzer2011reading} datasets using CNNs to demonstrate the performance of our active learning method with supervised classification. The experiments on MNIST have an initial balanced labeled set of $30$ samples, $3$ from every class, chosen at random and an acquisition batch of size $10$ is used. For CIFAR-10 and SVHN, we start with initial balanced labeled sets of $5,000$ and $3,000$ respectively, and acquire batches of size $5,000$ and $3,000$. 
We compare the performance of Info-NN with \emph{BatchBALD} \citep{kirsch2019batchbald}, \emph{K-Center} \citep{sener2017active}, \emph{MaxEntropy}, and \emph{Random} methods. 

\paragraph{Experiment results.}
While our method outperforms all the baselines on MNIST, on CIFAR-10 and SVHN, it performs almost on par with MaxEntropy. A possible explanation here is the embeddings learned might be of lower quality. The embeddings directly affect the informativeness and diversity measures, thereby impacting the quality of labels chosen. Using pre-trained networks instead could result in better quality embeddings. We do not explore this direction here but is an interesting avenue for future work.

 
\section{Conclusion} \label{sec:conclusion}

In this paper, we introduce a generalized similarity based active learning framework for selecting informative queries for both metric learning and classification. In a deep metric learning setting, we demonstrated that our framework is capable of outperforming recently developed methods for selecting batches of triplets on a both per-triplet and per-query basis. For classification, our framework for active label selection resulted in a better performance compared to the baselines. As shown by strong empirical performance, this framework marks the first step in developing generalized active learning methods capable of performing well in multiple problem areas. An avenue for future work is studying the proposed active label selection method for regression and improving the generalizability of the framework.

\begin{acknowledgements}
This work was supported, in part, by DARPA grant FA8750-19-C020 and NSF grant CCF-2107455.
\end{acknowledgements}

\bibliography{nadagouda_566}

\end{document}
