\documentclass{uai2023} % for initial submission
% \documentclass[accepted]{uai2023} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2023} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2023} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams
\usepackage{amsfonts}       % blackboard math symbols
\usepackage{graphicx}
\usepackage{subfig}
\usepackage{subcaption}
\usepackage{url}
\DeclareMathOperator*{\argmax}{arg\,max}
\DeclareMathOperator*{\argmin}{arg\,min}%\usepackage{natbib}
%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example
\title{On the Informativeness of Supervision Signals}

% The standard author block has changed for UAI 2023 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{\href{mailto:<jj@example.edu>?Subject=Your UAI 2023 paper}{Jane~J.~von~O'L\'opez}{}}
\author[1]{Harry~Q.~Bovik}
\author[1,2]{Further~Coauthor}
\author[3]{Further~Coauthor}
\author[1]{Further~Coauthor}
\author[3]{Further~Coauthor}
\author[3,1]{Further~Coauthor}
% Add affiliations after the authors
\affil[1]{%
    Computer Science Dept.\\
    Cranberry University\\
    Pittsburgh, Pennsylvania, USA
}
\affil[2]{%
    Second Affiliation\\
    Address\\
    …
}
\affil[3]{%
    Another Affiliation\\
    Address\\
    …
  }
  
  \begin{document}
\maketitle

\begin{abstract}
  Supervised learning typically focuses on learning transferable representations from training examples annotated by humans. 
  % like hard labels, soft labels, and similarity judgments, by pairing them with objectives like contrastive learning and classification that aim to maximally leverage the information contained within them. 
  While rich annotations (like soft labels) carry more information than sparse annotations (like hard labels), they are also more expensive to collect. 
  % Being able to quantify the informativeness of each type of label would enable researchers to optimize the cost of data annotation. 
  We use information theory to compare how a number of commonly-used supervision signals contribute to representation-learning performance, as well as how their capacity is affected by factors such as the number of labels, classes, dimensions, and noise. 
  % In this paper, we conduct an information-theoretic analysis of several commonly-used supervision signals to determine how they contribute to representation learning performance and how the dynamics are affected by parameters such as the number of labels, classes, and dimensions.
  Our framework provides theoretical justification for using hard labels in the big-data regime, but richer supervision signals for few-shot learning and out-of-training-distribution generalization. We validate these results empirically in a series of experiments with over [TODO: NUMBER OF ANNOTATIONS] crowdsourced image annotations and conduct a cost-benefit analysis to establish a tradeoff curve that enables users to optimize the cost of supervising representation learning on their own datasets.
  % Our framework provides theoretical explanations for deep learning phenomena like why hard labels work well in the big data regime, but few-shot learning requires much richer supervision signals. We validate these results empirically in a series of experiments with over [TODO: NUMBER OF ANNOTATIONS] crowdsourced image annotations and conduct a cost-benefit analysis to establish a tradeoff curve that enables users to optimize the cost of supervising representation learning on their own datasets. 

\end{abstract}


\section{INTRODUCTION}
%Training a classifier on a large set of simple hard labels is a well-established technique in deep learning (e.g., ImageNet pretraining), but it remains an open theoretical question why this kind of task-specific pre-training should result in ``good'' representations that actually capture the underlying structure of the data. Meanwhile, recent work suggests that training models on a relatively small set of rich labels that capture human uncertainty can also lead to high performance on tasks like few-shot image classification. 

 % There is growing evidence that labels derived in this manner capture important statistical regularities of naturalistic stimuli \cite{}, and can be used to improve the robustness of downstream machine-learning applications that rely on having good representations \cite{}. 
 
% Learning transferable representations by training a classifier is a well-established technique in deep learning (e.g. ImageNet pretraining~\citep{huh2016makes, ridnik2021imagenet}), but there is a lack of theory to explain why this should result in `good' representations or how the dynamics are affected by training parameters like the number of labels, classes, and dimensions in the training dataset. 

%One of the factors contributing to the success of modern deep learning is effective representation learning. Representation learning aims to learn useful latent embeddings of input stimuli. Generally, training a neural network means learning successive layers of representations that, typically, will be used in the final layer to perform some sort of task (e.g., classification).  The key decision in implementing a representation-learning framework often revolves around designing a supervision signal for the model by quantifying the similarity between stimuli. Significant work has gone into the design of supervision signals for deep representation learning resulting in a plethora of tasks using contrastive objectives \citep{chen2020simple,khosla2020supervised}, classification objectives \citep{huh2016makes,ridnik2021imagenet}, reconstruction objectives \citep{devlin2018bert, kingma2013auto}, and many others \citep{guo2019deep}. In addition, deep representation learning techniques provide exciting new avenues for cognitive science. Specifically, uncovering the structure of human representations by directly training on behavioral data such as similarity judgments can provide a modern alternative to classic methods such as multi-dimensional scaling \citep{shepard1980multidimensional}.

Modern machine learning relies heavily on using large amounts of labeled data, and those labels typically come from annotations generated by humans. This raises an important question: how can we most efficiently use human annotations to create objective functions for machine learning systems? This is not just a matter of designing good interfaces and algorithms for collecting annotations---it involves a subtle interplay between the choices we make about the supervision signals we use to train our models and the difficulty of collecting the relevant annotations. For example,  soft labels for images (indicating uncertainty via a distribution over classes) are more expensive to collect than hard labels (indicating a single class), but are also potentially more informative to the learner ~\citep{peterson2019human,sucholutsky2021soft, Sucholutsky_Schonlau_2021,collins2022eliciting}. Making good choices about what questions to ask humans about our data requires understanding the informativeness of different supervision signals. 

In this paper, we explore this question for the case of {\em representation learning}, where the aim is to learn useful latent embeddings of input stimuli. Generally, training a neural network means learning successive layers of representations that will be used to perform some sort of task (e.g., classification).  The key decision in implementing a representation-learning framework often revolves around designing a supervision signal by quantifying the similarity between stimuli. Significant work has gone into the design of supervision signals for deep representation learning resulting in contrastive objectives \citep{chen2020simple,khosla2020supervised}, classification objectives \citep{huh2016makes,ridnik2021imagenet}, reconstruction objectives \citep{devlin2018bert, kingma2013auto}, and many others \citep{guo2019deep}.
%Contrastive learning has been a particularly productive framework for representation learning, relying upon a loss function that rewards pushing together  embeddings of similar stimuli and pushing apart  embeddings of dissimilar ones \citep{chen2020simple,khosla2020supervised}. However, something akin to contrastive learning also implicitly occurs with a classification objective, with embeddings from the same class pushed together  and embeddings from different classes pushed apart. 
For example, recent work has shown that models trained with hard labels on classification tasks can approximate the structure of human latent representations 
at a fraction of the cost of exhaustively collecting the pairwise-similarity judgments required for soft contrastive learning~\citep{peterson2018evaluating,marjieh2022words}. %In addition, there is recent evidence that soft labels which capture human uncertainty over the distribution of classes associated with an image may improve performance, particularly when learning from very small datasets, for both models~\cite{peterson2019human, sucholutsky2021soft, Sucholutsky_Schonlau_2021, collins2022eliciting} and humans~\cite{malaviya_sucholutsky_oktar_griffiths_2022} performing classification tasks. 
Thus, in many cases, different representation learning objectives are possible, and it is not clear when one objective should be preferred. 

% [TODO: add sentence about human uncertainty being useful but typical binary contrastive learning and hard-label classification ignoring it, citing all our various previous papers on this]

\begin{figure}[htb!]
    \centering
    \includegraphics[width=0.85\linewidth]{RISS.pdf}
    % \includegraphics[width=0.49\textwidth]{RISS2.jpg}
    % \hfill
    % \includegraphics[width=0.49\textwidth]{RISS3.jpg}
    \caption{An example workflow that a user may go through when using our framework to decide what label type(s) to collect when annotating a dataset.\textbf{Top}: User specifies the task. \textbf{Middle}: Information-theoretic characterization of the problem setting (e.g., many classes, but few examples). Heatmaps show Spearman rank correlation ($\rho$) of pairwise similarities recovered from running Generalized Nonmetric Multi-Dimensional Scaling on simulated labels when varying latent dimensionality ($d$), number of points ($n$), and number of classes ($k$). \textbf{Bottom}: Cost-benefit analysis of signal type based on subjective utility ($u(\rho)$), cost weighting ($\beta$), number of points ($n$), and number of classes ($k$).}
    \label{fig:sims}
\end{figure}
% \subsection{Contributions}

% Our goal in this paper is to compare the informativeness of various supervision signals % (i.e. types of annotations) in order 
% and enable researchers to optimize data annotation for supervised representation learning:
% \begin{itemize}
%     \item We develop an information-theoretic framework for analyzing supervision signals and use it to compare hard labels and soft labels
%     \item We quantify their relative (representational) information content by comparing them to similarity triplets (i.e., queries of the form ``Is $x$ more similar to $y$ %than to
%     or $z$?'')
%     \item We relate this information to three key features of machine-learning datasets: number of labels, number of classes, and dimensionality. 
%     \item We find that while both hard and soft labels provide information about hidden representations, their robustness to these three variables differs greatly (Table~\ref{tab:tab1})
%     \item We show how the marginal information provided by each label translates into better representation learning performance using simulations
%     \item We conduct a cost-benefit analysis that shows soft labels are expensive to collect but provide more representational information while hard labels are cheap but fairly uninformative (Figure~\ref{fig:sims})
%     \item We propose several types of sparse representations that enable selective interpolation between the soft and hard label regimes (Figures~\ref{fig:fig2},~\ref{fig:fig3})
%     \item We combine these analyses to establish a tradeoff curve that enables users to optimize the cost of labeling their datasets for supervised representation learning
%     \item We corroborate our theory by conducting experiments with crowdsourced human similarity judgments and various types of hard and soft and hard labels collected for \texttt{CIFAR-10}~\citep{krizhevsky2009learning}
%     \item We extend our analysis to image classification, and show that particular types of softness confer more robust performance under increasing distributional shift and in the few-shot regime
% \end{itemize}     
%  We release all code, data, and results, including our novel \texttt{CIFAR-10DS} soft-labels, in the Supplementary Materials. 
 % We release all code, data, and results---including our novel \texttt{CIFAR-10DS} and \texttt{CIFAR-10T} soft-label variants---in the Supplementary Materials. 

 % Our goal in this paper is to compare the informativeness of various supervision signals (i.e. types of annotations) in order and enable researchers to optimize data annotation for supervised representation learning.
 
Our goal in this paper is to compare the informativeness of various supervision signals (i.e. types of annotations) to empower researchers to optimize data annotation for their supervised representation learning tasks (see Figure~\ref{fig:sims}). We develop an information-theoretic framework for analyzing supervision signals and use it to compare two popular supervision signals from  the classification literature---hard labels and soft labels. We quantify their relative (representational) information content by comparing them to similarity triplets (i.e., queries of the form ``Is $x$ more similar to $y$ than to $z$?''), a  supervision signal used in contrastive learning and cognitive science~\citep{jamieson2011low,hoffer2015deep}. We relate the information each signal provides to three common features of machine learning datasets: number of labels, number of classes, and dimensionality, and find that while both hard and soft labels provide information about hidden representations, their responses to these three variables are very different. Simulations confirm these results, showing how the marginal information provided by each label translates into better representation learning performance. A cost-benefit analysis comparing soft and hard labels shows there are meaningful differences between them: soft labels are expensive to collect but provide a considerable amount of representational information while hard labels are cheap but fairly uninformative (Figure~\ref{fig:sims}). To close this gap, we consider several types of sparse representations that enable selective interpolation between the soft and hard label regimes (Figures~\ref{fig:fig2},~\ref{fig:fig3}). We extend our analysis with these sparse supervision signals to establish a tradeoff curve allowing users to optimize the cost of labeling their datasets for supervised representation learning. %(Figure~\ref{fig:fig4}). 
Finally, we confirm the theoretical and simulation results by conducting experiments with crowdsourced human similarity judgments and various types of soft and hard labels collected for \texttt{CIFAR-10}~\citep{krizhevsky2009learning}. We release all code, data, and results---including our novel \texttt{CIFAR-10DS} (details in Section~\ref{sec:experiments})---in the Supplementary Materials. 


% \begin{table*}[htb!]
%     \centering
%     \caption{Properties of Supervision Signals}
%     \begin{tabular}{lccl}
%     \hline
%     Signal& $T(n,k)$ (\# of constraints)& Information ratio & Asymptotic behavior\\
%     \hline
%     Hard labels    &   $n(k-1)+n^2(1-1/k)$ & $\frac{2n(n+k-1 -n/k)}{(n+k)(n+k-1)(n+k-2)}$& ${\small \begin{cases}
% O(\frac{1}{n}) & n>>k\\
%  O(\frac{1}{n}) & n=k \\
% O(\frac{1}{k^2}) & n<<k
% \end{cases}}$\\
%     Soft labels    & $nk(k-1)/2+kn(n-1)/2$ & $\frac{kn}{(n+k)(n+k-1)}$  & $ {\small \begin{cases}
%     O(\frac{1}{n}) & n>>k\\
%     O(1) & n=k \\
%     O(\frac{1}{k}) & n<<k
%     \end{cases}}$\\
%     \hline
%     % Pairwise similarities &&&
%     \end{tabular}
%     \label{tab:tab1}
% \end{table*}

\section{Related Work}
% \subsection{
% %Supervision Signals and 
% Related Work}
\label{sec:signal_types}
Throughout this paper, we discuss a variety of supervision signals. In this section, we identify the practical settings these signals correspond to and summarize related work. 

\textbf{Pairwise similarity judgments} 
For every pair of points in the dataset labelers are asked to rate the similarity on a fixed, bounded scale (e.g., a Likert scale). This signal has a long history of being employed by cognitive scientists to learn about (hidden) human representations of stimuli, typically by using an embedding method like multi-dimensional scaling (MDS)~\citep{shepard1980multidimensional}, and more recently can be found in state-of-the-art contrastive learning objectives~\citep{chen2020simple, khosla2020supervised}. 

\textbf{Triplet similarity judgments}
For every set of three points $x,y,z$ in the dataset, labelers are asked to respond to queries of the form ``Is $x$ closer to $y$ than to $z$?'' This non-metric signal has also been used  as a method for learning human representations, often by applying embedding techniques like non-metric MDS~\citep{jamieson2011low,pmlr-v2-agarwal07a,6713995}; it is considered to be a more accurate alternative to pairwise similarity judgments as it is easier to compare two lengths than to provide consistent judgments over un-normalized scales. This signal is used in machine learning (typically with the triplet loss) and cognitive science~\citep{hoffer2015deep,hermans2017defense,hebart2020revealing,roads2021enriching}.

% , a  supervision signal used in contrastive learning and cognitive science~\citep{jamieson2011low,hoffer2015deep} 

\textbf{Hard labels} For each point, labelers  pick a single, most relevant class out of the fixed set of all classes in the dataset. This signal is typically used for classification, though it can also be used for representation learning via pre-training~\citep{huh2016makes,ridnik2021imagenet}.

\textbf{Soft labels} For each point, labelers assign proximity or probability to each of the fixed set of all classes in the dataset. This signal is also used for classification and representation learning via pre-training, and can be more effective than hard labels, particularly in settings with small data~\citep{Xie_2020_CVPR, Sucholutsky_Schonlau_2021, sucholutskv2021one, Liu2021LAST, malaviya_sucholutsky_oktar_griffiths_2022}.

\textbf{Top-class soft labels} The researcher picks a subset of size $\hat k$ of the classes in the dataset that maximize mutual information (estimated based on the already collected subset). For each point, labelers assign proximity or probability to each of  the fixed set of classes in the subset.  The number of classes can be reduced by simply taking an arbitrary subset, or more systematically via methods like label coarsening~\citep{hanneman2011automatic}. 

\textbf{Sparse soft labels} For each point, labelers assign proximity or probability to each of the exactly $\hat k$ most relevant classes out of the fixed set of all classes in the dataset~\cite{collins2022eliciting}. In addition to the connections discussed above for the other soft label variants, this particular variant is also connected to top-k classification~\citep{lapin2017analysis} and soft vector quantization~\citep{seo2003soft}. 

% \textbf{Principal component analysis (PCA) labels} The experimenter picks $\hat k$ new classes---which are in fact mixtures of the original classes from the dataset---that maximize a PCA objective over the initially collected subset~\citep{wold1987principal}. For each point, labelers assign proximity or probability to each of the classes in this new fixed set of classes. While technically not a supervision signal that could be collected using conventional methods, we apply PCA to the true hidden representations of the simulated datasets we consider in this paper as a way of estimating the upper bound of what performance a (theoretical) label with $\hat k$ components could achieve. % 

% In this section, we formalize representation learning, define a measure of representational information in terms of similarity triplets, derive the amount of information provided by hard and soft labels, and propose a metric that acts as a proxy for representation recovery performance.


\section{RELATIVE INFORMATIVENESS OF SUPERVISION SIGNALS}
\textbf{Motivation}
We consider the scenario where a researcher collects an initial set of labels for a small subset of their dataset and wants to use it to optimize the labelling of their entire (large) dataset (Figure~\ref{fig:sims}, top). The researcher has a number of options for what kind of labels to collect and wishes to maximize representation learning performance while minimizing labelling cost.

We formalize representation learning as the process of recovering a hidden (low-dimensional) latent structure from a set of (high-dimensional) stimuli (e.g. images). In particular, we focus on the non-metric setting where we want to recover the correct rank order of pairwise distances between all hidden latent vectors. Our goal is to determine which supervision signal is most effective (in terms of both performance and cost) for representation learning.

\textbf{Problem Definition}
Consider a set of stimuli $\{x_i\}_{i=1}^n \in \mathbb{R}^d$ with some associated latent representations $\{z_i\}_{i=1}^n \in \mathbb{R}^h$. The distance between each pair of latent vectors induces a relational order over latent pairs and our goal is to find a function $f:\mathbb{R}^d \rightarrow \mathbb{R}^h$ (where typically $h<<d$) such that it preserves this relational order, that is, $||f(x_i)-f(x_j)||\leq||f(x_i)-f(x_k)||$ iff $||z_i-z_j||\leq||z_i-z_k||$. Crucially, the latent vectors are accessible only implicitly via different supervision signals (or queries) such as hard and soft labels. We operationalize the informativeness of different supervision signals as the number of relational constraints that a naive learner can recover based on them (i.e., a learner that attempts to follow the signals as is without applying other geometric constraints such as triangle inequalities).  

\textbf{Triplet Constraints}
Conceptually, when training a neural network for classification, providing a label for a point roughly corresponds to requiring that the network weights should be updated such that the embedding of this point will be \textbf{closer to one class than to other classes}. For this analysis, we assume that each class can be represented by its centroid (e.g., each class is unimodal), and so classification labels provide information about proximity of latent vectors to these centroids. When training with batches, providing labels for a batch additionally corresponds to requiring that the centroid of each class be \textbf{closer to its associated set of embeddings than to the other embeddings in the batch}. In both cases, neural networks are optimizing constraints of the form ``$x$ is closer to $y$ than to $z$,'' which we call ``triplet constraints''. We now formalize this concept to use it as a measure of information content in labels.

Suppose we have a system with $n$ labels, $k$ classes with centroids $C_1,...,C_k$, and  stimuli $\{x_i\}_{i=1}^n$ with latent representations $\{z_i\}_{i=1}^n$. 
A triplet constraint is an inequality of the form $||z_a-z_b||<||z_a-z_c||$. This can be rewritten as the query $r_{i, j, k}=\left\{x \in \mathbb{R}^{d}:\left\|f(x_{j})-f(x_{i})\right\|<\left\|f(x_{k})-f(x_{i})\right\|\right\}$, and each such query provides at most one bit of information \citep{jamieson2011low}. 
For any set of three stimuli, there are three unique queries: $r_{i,j,k}, r_{j,i,k}, r_{k,i,j}$. Thus, the total number of unique queries for $n$ stimuli is $3{n \choose 3}$. 
However, in the case of hard and soft labels, we make queries not only in terms of the $n$ objects but also in terms of the $k$ class centroids. In other words, we are seeking to recover embeddings not only for the $n$ points of interest, but also $k$ additional reference embeddings. As a result, the total number of unique queries in these cases is $3{n+k \choose 3}$. %To be consistent with our query notation above, we define $x_{n+i} = C_i$  and $z_{n+i} = f(C_i)$. 




\textbf{Hard Labels}
%In practice, collecting hard labels corresponds to posing the following question to a labeler: `Which of these classes does this point best correspond to?'
We define the hard label for stimulus $x$ as a vector $l$ of length $k$, such that $l_i=
1 \text{ if } i=\argmin_j(||f(x)-f(C_j)||) \text{ and } 0 \text{ otherwise}.$
% \begin{equation}
%     l_i=\begin{cases}
% 1 & i=\argmin_j(||f(x)-f(C_j)||)\\
% 0 & otherwise
% \end{cases}
% \label{eqn:hard-label}
% \end{equation}
We can now extract two types of triplet queries. The first is a triplet consisting of a class centroid $C_i$, a stimulus $x_P$ that is a ``positive'' example of this class, and a negative example stimulus $x_N$. The query has the form $||f(x_P)-f(C_i)||<||f(x_N)-f(C_i)||$. The second is a triplet consisting of a stimulus $x_i$, the class centroid that is closest to it ($C_P$), and another class centroid further away ($C_N$). This query has the form $||f(C_P)-f(x_i)||<||f(C_N)-f(x_i)||$. If the hard labels are distributed evenly between $k$ classes, then on average there are $n/k$ stimuli per class. Then $n$ hard labels give us $n(k-1)$ constraints of the second type and $k(n/k)(n-n/k)=n^2(1-1/k)$ of the first type---a total of $T_H(n,k)=n(k-1)+n^2(1-1/k)$  constraints. 

\textbf{Soft Labels}
To produce a probability distribution over classes, neural networks often have a softmax activation function after the output layer~\citep{bridle1989training,martins2016softmax,krizhevsky2017imagenet}.  Accordingly, we define the soft label for a stimulus $x$ as a vector $l$ of length $k$, such that $l_i=\frac{e^{-||f(x)-f(C_i)||}}{\sum_je^{-||f(x)-f(C_j)||}}$. %\begin{equation}
%     l_i=\frac{e^{-||f(x)-f(C_i)||}}{\sum_je^{-||f(x)-f(C_j)||}}
%     \label{eqn:soft-lab}
% \end{equation}
There are again two types of triplet queries that we can extract from soft labels. The first is a triplet of the form $||f(x_P)-f(C_i)||<||f(x_N)-f(C_i)||$ where $C_i$ is the centroid of class $i$ and $x_P, x_N$ are two training set points with corresponding soft labels $l^P, l^N$ such that $l^P_i > l^N_i$.
The second  is a triplet consisting of the form $||f(x)-f(C_i)||<||f(x)-f(C_j)||$ where $x$ is a training set point corresponding to label $l$ and $C_i, C_j$ are the centroids of classes $i,j$ such that $l_i>l_j$. Our $n$ soft labels thus give us $nk(k-1)/2$ constraints of the second type and $kn(n-1)/2$ of the first---a total of $T_S(n,k)=kn(k+n-2)/2$ triplet constraints. 

% However, we note that depending on choice of soft label function $l$, it may not always be the case that $||f(x_P)-f(C_i)||<||f(x_N)-f(C_i)||\iff l^P_i > l^N_i$. For our choice of $l$, which can be described as the softmax over negative distances to class centroids, we have the following results. 
% Suppose $a, b \in \mathbb{R}^d, C_1,...,C_k \in \mathbb{R}^d, D_{ai} = ||C_i-a||, D_{bi} = ||C_i -b||, l_{ai} = \frac{e^{-D_{ai}}}{\sum_{j=1}^ke^{-D_{aj}}}, l_{bi} = \frac{e^{-D_{bi}}}{\sum_{j=1}^ke^{-D_{bj}}}$.
% \begin{align}
%     l_{ai}>l_{bi} & \implies D_{bi}>D_{ai} + \ln{\sum_{j=1}^ke^{-D_{aj}} - \ln{\sum_{j=1}^ke^{-D_{bj}}}}
%     \label{eqn:label_imp}
% \end{align}
% Let $\epsilon_a=\min_i{D_{ai}}$ be the distance from point $a$ to the nearest class centroid, then $-\epsilon_a<\ln{\sum_{j=1}^ke^{-D_{aj}}}<\ln{k}-\epsilon_a$. We can then refine our bound from Equation \ref{eqn:label_imp}. 
% \begin{align}
%     l_{ai}>l_{bi} & \implies D_{bi}>D_{ai} + \epsilon_b -\epsilon_a - \ln{k}
% \end{align}

% \subsubsection{Pairwise similarities}
% [FILL]
\textbf{Information Ratio}
While we now have a measure of how much information each label provides, it is unclear how much information is actually needed to recover human-aligned representations for all the objects. Intuitively, we would expect that more information is required when more objects are being embedded (i.e. when $n+k$ increases). We can normalize our results from the previous section to account for this by taking the ratio of the number of constraints we can recover from a set of labels to the total number of possible queries (i.e. $IR(n,k)=\frac{T(n,k)}{3{n+k \choose 3}}$). This \textbf{``information ratio''} may be a proxy for how much information we are recovering about the latent representations. We present the information ratios for hard and soft labels in the ``Analysis'' portion of Figure~\ref{fig:sims} along with their asymptotic behavior in three regimes: the many-shot case (where there are many more points than classes), the one-shot case (where there is one point per class; \citep{1597116}), and the less-than-one-shot case (where there are fewer points than classes; \citep{Sucholutsky_Schonlau_2021}). 

Our results predict three scaling phases for soft labels and two scaling phases for hard labels. In particular, hard labels and soft labels are predicted to have similar asymptotic behavior in the many-shot case which may explain why pre-training on very large datasets using hard-label classification is effective (e.g. \citep{huh2016makes,ridnik2021imagenet}). However, soft labels are predicted to have much better representation learning performance in the one-shot and less-than-one-shot cases. Notably, the results predict that in the one-shot case, the quality of representations learned from soft labels should not degrade (as it does in every other case) when the number of points and classes increases.

\textbf{Representation Learning as Communication} Our representation learning setup can be seen as a communication problem where we want to recover a fixed number of bits about hidden representations of a black-box model. The model can be queried via multiple channels, each of which has its own encoder and corresponds to a different choice of supervision signal. As established by~\citet{jamieson2011low}, each triplet query is a single bit. In information theory, the efficiency, also known as the normalized entropy, of each channel is defined as $\eta(X) = \frac{H(X)}{H_{max}(X)}$ where $H_{max}$ is max entropy (i.e. the total number of bits) and $H$ is entropy (i.e. the number of bits remaining with unknown states after transmission). We defined information ratio as the ratio of the number of bits recovered to the total number of bits, and thus a stochastic version of information ratio (with data sampled from a distribution instead of being fixed) can be defined as $IR(X)=\frac{H_{max}(X)-H(X)}{H_{max}(x)}=1-\eta(X)$.

\textbf{Signal-to-Noise Ratio} Our triplet analysis implicitly assumed that the noise distribution was the same between each type of label and could thus be ignored when comparing their relative informativeness. However, in practice, eliciting different kinds of annotations may be associated with different levels of noise due to changes in difficulty, user interface, participant pools, etc. Within our information-theoretic framework, noise can be viewed as the probability $\epsilon_s$ of a bit flip (i.e. that we get the incorrect response to a query of the form ``Is $x$ closer to $y$ than to $z$?'') during communication over channel $s$ . If a set of $n$ labels (with $k$ classes) of type $s$ provides $T_s(n,k)$ triplet constraints under our framework in the noise-free case, then the number of constraints in the noisy case is just $(1-\epsilon_s)T_s(n,k)$ where $\epsilon_s$ is the bit flip rate for labels of type $s$. %The associated information ratio is updated accordingly. 

% \subsection{Optimizing sparsity}

% [TODO: Something about how there is a big jump between soft and hard labels and how we can vary the number of features/sparsity to try and optimize cost-benefit tradeoff]

% \paragraph{PCA labels}
% [What is the best we can do given a fixed number of bits per label? ]

% \paragraph{Top-class labels}
% [Measure mutual info to determine least informative classes, only collect similarity to remaining classes]

% \paragraph{Sparse labels}
% [Ask labelers to only rate similarity to closest classes]

\begin{figure*}[htb!]
    \centering
    \includegraphics[width=0.25\textwidth]{soft-label-dimensionality_A.png}
    \includegraphics[width=0.74\textwidth]{soft-label-dimensionality_B.png}
    \caption{\textbf{Left}: Effective dimensionality of soft labels at different combinations of $n$ and $k$. \textbf{Right}: Three examples of PCA curves used to compute effective dimensionality of soft labels for every combination of $n$ and $k$.}
    \vspace{-3mm}
    \label{fig:fig2}
\end{figure*}
\section{SIMULATIONS}
% \subsection{Hard Labels vs. Soft Labels}
% \paragraph{Simulations.} 
Assuming information ratios are a proxy for representation learning performance, our analysis predicts soft labels should lead to better performance than hard labels, particularly when there are few labels and many classes. However, we still need to understand how information ratios actually translate to representation learning performance. 

We conducted simulations to see the effect of four variables (and their interactions) on representation learning performance: label type (soft or hard), number of points ($n$), number of classes ($k$), and latent dimension ($d$).
We consider values of $n$ and $k$ in the range of $[3,90]$, and $d\in\{5,25,125\}$. For each combination of $n,k,d$ we sample a total of $n$ points from Gaussians centered at $k$ random locations $C_1,...,C_k \in \mathbb{R}^d$. 
We computed hard and soft labels for these points using the equations defined above and then mine all triplet constraints of both types from both sets of labels.  We apply Generalized Nonmetric Multi-Dimensional Scaling (GNMDS;  \cite{6713995}) to both sets of queries to find embeddings that best fit the respective triplet constraints. The Gram matrix outputted by GNMDS can be interpreted as the predicted (unnormalized) pairwise cosine similarities between all $n+k$ points and centroids.

To understand how much information we recover from each of these two sets of queries, we construct a matrix of the true pairwise cosine similarities for the set of all $n+k$ points and class centroids and compute the Spearman rank correlation ($\rho$) between the upper triangle of the Gram matrices and the ground truth matrix. Thus, a higher $\rho$ corresponds to better recovery of the underlying latent representations. 

We visualize the results of the simulations in Figure~\ref{fig:sims}. The results confirm the theoretical findings from the previous section. Specifically, the three phases for soft labels and two phases for hard labels match our analytical results, and a higher information ratio translates into better performance.

\section{COST-BENEFIT TRADEOFFS}%{Cost-benefit tradeoffs.}
We can now construct cost-benefit tradeoff curves to determine when a user would prefer to use one signal over the other. Suppose we define $\rho$ as above, and subjective utility as $U(\rho)$. This utility function can take many forms (e.g., $U(\rho)=b\rho \text{ or } b\sigma(\rho) \text{ where } \sigma \text{ is the sigmoid function and } b>0$). If we assume that the cost of collecting a soft label over $k$ classes is about $k$ times more expensive than collecting a hard label, we can define the subjective loss function as $L_s=C(s) - U(\rho), \text{ where } C(s)=cn \text{ if } s\in S_{hard} \text{ and } cnk \text{ if } s\in S_{soft}$.
%$L_s=C(s) - U(\rho), C(s)=\begin{cases} cn & s\in S_{hard}\\ cnk &s\in S_{soft} \end{cases}$. 
This is equivalent to optimizing $\hat L = \frac{c}{b}\hat c(s) - \hat u(\rho)$ which we can re-parametrize to a form reminiscent of the information bottleneck~\citep{tishby2000information}: $\hat L = \beta\hat c(s) - \hat u(\rho)$. Since we have shown that the information ratio, which we define as $\hat\rho$, can provide us with an estimate for $\rho$, we can also replace $U(\rho)$ by $U(\hat\rho)$. 

We investigate cost-benefit tradeoffs by varying $(\beta, \hat u)$, showing the results for several combinations in Figure~\ref{fig:sims}. While the results depend greatly on choice of $\hat u$, a few regularities emerge. First, in all cases, regardless of $\beta$, when the number of classes ($k$) or the number of points ($n$) is low, soft labels are roughly as preferred as, or more preferred, than hard labels. Second, when there is an emphasis on cost (i.e. high $\beta$), hard labels become preferable as $n$  and $k$ both increase, but when the emphasis is on performance (i.e. low $\beta$), soft labels remain preferable as $n$ and $k$ increase.
% \begin{figure}
%     \centering
%     \includegraphics[width=\textwidth]{RISS3.jpg}
%     \caption{Supervision signal preference based on subjective utility function, cost weighting parameter ($\beta$), number of points ($n$), and number of classes ($k$).}
%     \label{fig:tradeoffs}
% \end{figure}


\section{LABEL OPTIMIZATION}

\begin{figure}[t!]
    \centering
    \includegraphics[width=\linewidth]{RISS_fig_3.png}
    \caption{Label sparsity. \textbf{Left}: Spearman rank correlation ($\rho$) of pairwise similarities recovered from PCA labels, sparse labels, and top-class labels when varying the maximum number of features/classes in the labels ($\hat k$), number of points in the dataset ($n$), and number of classes in the dataset ($k$). \textbf{Right}: Comparison of sparsity curves for PCA labels (blue), sparse labels (purple), and top-class labels (red) for several combinations of $n$ and $k$. Straight lines correspond to soft label (solid green) and hard label (dashed green) performance. }
    \label{fig:fig3}
    \vspace{-4mm}
\end{figure}
% \begin{figure*}[htb!]
%     \centering
%     \includegraphics[width=0.24\textwidth]{fig4/fig4A.png}
%     \includegraphics[width=0.24\textwidth]{fig4/fig4E.png}
%     \includegraphics[width=0.24\textwidth]{fig4/fig4C.png}
%     \includegraphics[width=0.24\textwidth]{fig4/fig4H.png}
%     \includegraphics[width=0.24\textwidth]{fig4/fig4B.png}
%     \includegraphics[width=0.24\textwidth]{fig4/fig4F.png}
%     \includegraphics[width=0.24\textwidth]{fig4/fig4D.png}
%     \includegraphics[width=0.24\textwidth]{fig4/fig4G.png}
%     \caption{Loss curves for soft labels (solid green), sparse labels (purple), top-class labels (red), and hard labels (dashed green) based on subjective utility function ($u(\rho)$), cost weighting parameter ($\beta$), sparsity ($\hat k$), number of points ($n$), and number of classes ($k$). Global minima for sparse labels and top-class labels are marked with a point.}
%     \label{fig:fig4}
% \end{figure*}
\textbf{Effective Dimensionality}
While soft labels appear to be more informative than hard labels, they are not necessarily an optimal encoding for efficiently communicating information about representations. In order to study how an optimal encoding might perform, we run principal component analysis (PCA) on our simulated datasets and observe the effects of varying the number of PCs that are retained ($\hat k$). The resulting performance curves provide an approximate upper bound on how well any set of vectors of length $\hat k$ (e.g. a set of soft labels) can communicate information about representations. We can also use these curves to understand how efficient soft labels are at this task. Specifically, we define the ``effective dimensionality'' of a set of soft labels as the minimum number of PCs ($\hat k$) necessary to achieve the same representation learning performance as when using the soft labels in the way we described in the previous section. In Figure~\ref{fig:fig2}, we show the effective dimensionality of soft labels for different combinations of $n$ and $k$ along with several examples of the PCA curves. We found a strong positive correlation between information ratio and effective dimensionality ($r=0.734, p<10^{-15}$)  providing further evidence information ratio is a useful metric for evaluating representation learning signals and predicting performance.

\textbf{Label Sparsity}
Since the effective dimensionality of soft labels appears to generally be much lower than the number of components of each soft label ($k$), this suggests that soft labels are an inefficient encoding for representational information, potentially due to redundancy. We now investigate two methods for remedying this inefficiency by introducing sparsity into soft labels in a disciplined manner. 

The first method, which we call ``top-class soft labels'', introduces sparsity globally by ignoring classes that are the least informative across the entire dataset. Formally, we construct a matrix $X$ where the $i$-th row correspond to the $i$-th soft label and the $j$-th column corresponds to the $j$-th class, and estimate the mutual information between each column and the ground-truth similarity matrix. We then keep only the top $\hat k$ most informative columns and set the rest to $0$. 

The second method, which we call ``sparse soft labels'', is to introduce sparsity locally by ignoring classes that are least informative for each point. Formally, we again construct a matrix $X$ as above, but now for each row, we individually keep only the $\hat k$ largest components and set the rest to $0$. These two methods provide a way to reduce the cost of collecting soft labels, while retaining much of the information. 

In Figure~\ref{fig:fig3}, we visualize representation learning performance when using PCA, sparse labels, or top-class labels at different levels of sparsity. While sparse labels consistently outperform top-class labels (which is to be expected since sparse labels provide finer control over how sparsity is induced), we note that there is also a gap between PCA and sparse labels. This again suggests that it may be possible to design more effective labeling methods than soft labels, potentially by defining a search space over possible classes and applying a procedure like PCA to optimize over it. We leave this as a promising direction for future work.


\textbf{Optimizing Label Collection}
Since top-class and sparse labels provide a way to selectively interpolate between the soft label and hard label regime, we can now use sparsity to optimize the cost-benefit tradeoff curves discussed above beyond the binary preference optimization shown at the bottom of Figure~\ref{fig:sims}. We use the same loss functions as above with several combinations of cost parameter $\beta$ and utility function $\mu$, but we now apply them to top-class and sparse labels at various levels of sparsity ($\hat k)$. We visualize a number of these cost-benefit tradeoff curves in the Supplementary Materials. By picking the $\hat k$ that corresponds to minimal loss, we can now optimize our label collection to minimize cost while maximizing performance. The results suggest that using sparse labels and picking the right level of sparsity can often provide big gains as opposed to using either hard labels or regular (dense) soft labels.

% \section{DISCUSSION}
% The results suggest that there is generally no optimal heuristic for collecting labels (e.g. ``always collect soft labels'', ``always use 20\% sparsity'', etc.) without .

% \begin{figure}[t]
%     \centering
%     \includegraphics[width=0.44\linewidth]{sparse-soft-labels_A.png}
%     \includegraphics[width=0.49\linewidth]{sparse-soft-labels_B.png}
%     \caption{\textbf{Left}: Spearman rank correlation ($\rho$) of pairwise similarities recovered from running GNMDS on sparsified soft labels when varying the maximum number of features ($\hat k$), number of points ($n$), and number of classes ($k$). \textbf{Right}: Seven examples of soft label sparsity curves for different combinations of $n$ and $k$. }
%     \label{fig:my_label}
% \end{figure}

% \begin{figure}[t]
%     \centering
%     \includegraphics[width=0.44\linewidth]{top-class-soft-labels_A.png}
%     \includegraphics[width=0.49\linewidth]{top-class-soft-labels_B.png}
%     \caption{\textbf{Left}: Spearman rank correlation ($\rho$) of pairwise similarities recovered from running GNMDS on top-class soft labels when varying the maximum number of classes ($\hat k$), number of points ($n$), and number of classes ($k$). \textbf{Right}: Seven examples of soft label class-sparsity curves for different combinations of $n$ and $k$. }
%     \label{fig:my_label}
% \end{figure}
\section{Experiments}\label{sec:experiments}
In this section, we investigate how our theory applies to real soft labels crowdsourced from human annotators. First, we present a number of different methods for collecting soft labels---including several new sets of soft labels that we crowdsource for this study---and place them into the framework described above. Second, we use a large dataset of similarity judgements to assess the representation-learning potential of the different label types. Finally, we assess how the inductive biases conferred by different label types affects the classification performance of a range of convolutional neural networks (CNNs).

% Here, we expect improved representation structure to make classification performance more robust---particularly in the out-of-training-distribution and few-shot regimes. 

% We next investigate the fidelity of our theory on \textit{real human-derived labels}. We focus on the CIFAR-10 dataset [cite], investigating a compendium of soft labels. We train .... models over these labels and study ... 

\subsection{Experimental Setup}

% \subsubsection{Supervision signals} 
We consider a range of supervision signals over the \textit{testing} subset of \texttt{CIFAR-10} images \citep{krizhevsky2009learning}, including hard labels, smoothed hard labels, soft labels, and similarity judgments. Each label type represents a different supervision signal presented in Section~\ref{sec:signal_types}. Each type of soft label was collected using different experimental interfaces, details of which are in the Supplementary Materials.

%, as it has been used to collect the greatest range of soft labels to date

% spanning a variety 
\textbf{CIFAR-10H} The \texttt{CIFAR-10H} labels, originally collected by \cite{peterson2019human, battleday2020capturing}, are derived by averaging crowdsourced hard labels (roughly 50 per image). These are then normalized at the image level to return probability distributions. 

\textbf{CIFAR-10DS} The \texttt{CIFAR-10DS} labels, a novel set of soft labels we crowdsourced for this study, are \textit{dense} soft labels (i.e., assigned over all classes in \texttt{CIFAR-10}). Annotators provided numerical judgments on a 0 (not at all) to 1 (completely) scale using sliders depending on how well each category described a certain image. 

\textbf{CIFAR-10S} The \texttt{CIFAR-10S} labels from \citeauthor{collins2022eliciting} are akin to \textit{sparse soft labels}. Annotators provided judgments about the likelihood of the top two categories, and specified any categories which they believed were definitely not possible (referred to as a ``clamp'', and treated as ``zero-probability'' classes). As the authors only collected such labels over 1{,}000 examples from the test set, we in-fill the remaining 9,000 with either: 1) hard labels (\textt{CIFAR-10S+hard}) or 2) simulated top-2 soft labels (\texttt{CIFAR-10S+dense}). 

\textbf{CIFAR-10LS} We derive two further sets of soft labels by applying label smoothing (LS) to the \texttt{CIFAR-10} hard labels. Unlike human-derived soft labels, the ``softness'' here is applied uniformly and independently of the associated image. This allows us to control for the previously observed regularization effects of label smoothing~\citep{muller2019labelSmoothHelp}. We pick the smoothing rate, $\epsilon$, to roughly match the distributions of our crowdsourced labels. The ``low'' level, $\epsilon\approx0.05$, is the average probability mass per soft label for the 9 non-maximal categories in the \texttt{CIFAR-10H} dataset. The ``high'' level, $\epsilon\approx0.2$, matches the \texttt{CIFAR-10DS} dataset.

%U nlike human-derived soft labels, the ``softness'' of these labels is not linked to their associated images---they are smooth rather than soft. 

\textbf{Similarity Judgments} We elicited similarity judgments over two subsets of 100 \texttt{CIFAR-10} test set images each by having human annotators rate the similarity between unlabeled image pairs on a Likert-scale ranging from 0 (completely dissimilar) to 6 (completely similar). The images were deliberately chosen to be ambiguous about class. Additional details are in the Supplementary Materials.

% [TODO: Results of ung GNMDS to get spearman correlations between labels and similarity judgments]


% Make table; have lr in it?

% \subsection{Cross-label performance...}\\

\subsection{Label Informativeness}

\textbf{GNMDS Results}
We first repeat our GNMDS analyses from the theory and simulation sections on all the CIFAR-10 label variants. 
We use GNMDS to get triplet-respecting embeddings for each label type and then compute several metrics. As before, we compute Spearman correlations ($\rho$) between the \texttt{CIFAR-10} GNMDS and elicited similarity judgments. Our previous analyses assumed a consistent signal-to-noise ratio, or error rate, across all the label types. However, with our set of CIFAR-10 label variants coming from different participant pools and elicitation pipelines, it is likely that error rates will vary between label types. We approximate error rates of each label type by counting the proportion of triplet queries derived from the GNMDS embedding whose binary responses do not match the responses from the corresponding triplets computed from the human similarity judgments (i.e. the bit flip rate). Additionally, we compute the entropy and the variance of the first order statistic (i.e variance of the probability mass assigned to the class with the highest assigned probability) for each label variant. 

We visualize these results in Figure~\ref{fig:gnmdsHuman}. We find that representation learning performance (measured by Spearman correlation) generally increases as softness (measured by label entropy and order statistic variance) increases, but does not increase when smoothing hard labels. However, we find there is a ``sweet spot'' for softness after which performance begins to decrease. We also find the expected linear relationship between error rate and performance that our framework predicts. We hypothesize that the non-monotonic relationship between softness and performance may partially be caused by increasing error rates, and partially by a resonance effect (see Supplementary Materials for details).


\begin{figure}[t!]
    \centering
    \includegraphics[width=0.45\textwidth]{gnmds_label_comp.png}
    \caption{Correspondence between various metrics and Spearman rank correlation for each label variant.}
    \vspace{-4mm}
    \label{fig:gnmdsHuman}
\end{figure}
% \begin{table}[!h]
%     \centering
%     \caption{Correspondence between human similarity judgments and various label varieties, sorted by label entropy.} \label{tab:gnmdsHuman}
% \begin{tabular}{lrrrr}
% \toprule
% {} &  $\rho$ &  Entropy &  Variance &  Error Rate \\
% \midrule
% \textbf{CIFAR-10} &      0.15 &           0.00 &            0.00 &        0.44 \\
% \textbf{-10H } &      0.40 &           0.15 &            0.01 &        0.36 \\
% \textbf{-10S+hard} &      0.53 &           0.07 &            0.01 &        0.32 \\
% \textbf{-10S+dense} &      0.53 &           0.28 &            0.01 &        0.32 \\
% \textbf{-10DS} &      0.36 &           0.69 &            0.04 &        0.37 \\
% \textbf{-LS (Low)} &      0.15 &           0.45 &            0.00 &        0.44 \\
% \textbf{-LS (High)} &      0.15 &           1.36 &            0.00 &        0.44 \\
% \bottomrule
% \end{tabular}
% \end{table}

% \begin{table}[!h]
%     \centering
%     \caption{Correspondence between human similarity judgments and the label types. Metrics are computed over the 1k subset annotated in \citeauthor{collins2022eliciting}.} \label{tab:gnmdsHuman}
% \begin{tabular}{lrrrr}
% \toprule
% {} &  $\rho$ &  Entropy &  Variance &  Error Rate \\
% \midrule
% \textbf{Hard Labels} &      0.15 &           0.00 &            0.00 &        0.44 \\
% \textbf{-10H} &      0.40 &           0.46 &            0.02 &        0.36 \\
% \textbf{-10S} &      0.53 &           0.69 &            0.03 &        0.32 \\
% \textbf{-10DS} &      0.36 &           0.93 &            0.05 &        0.37 \\
% \textbf{-LS (Low)} &      0.15 &           0.31 &            0.00 &        0.44 \\
% \textbf{-LS (High)} &      0.15 &           0.94 &            0.00 &        0.44 \\
% \bottomrule
% \end{tabular}

% \end{table}

\begin{figure*}[t!]
    \centering
    \includegraphics[width=0.9\textwidth]{crosslabel_eval.png}
    \includegraphics[width=0.9\textwidth]{generalization_checks.png}
    \caption{\textbf{Top:} Model performance on different label types at test time. \textbf{Bottom:} Generalization performance under increasing distributional shift. Each point represents the average score for a single model architecture (specified by color), trained on a particular label type (indicated via shape). Vertical lines represent points for a given label type.}
    \vspace{-3mm}
    \label{fig:genCheck}
\end{figure*}
\textbf{Models} We next train a range of natural-image classifiers on each of our supervision sets. These models use a number of distinct architectural features and reflect seminal developments in the progression of natural-image classification (see Supplementary Materials for further details).

\textbf{Cross-Label Performance}
We first assess how training image classifiers on one set of soft labels impacts their validation-set performance when testing on other label types. Our primary measure of performance is cross-entropy between the models' predictive distributions and soft labels. 
% Optional sentence about inductive biases and entropy matching;
Consistent with the GNMDS experiments, we observe a U-shaped relationship between label entropy and model performance for nearly all the soft-label types (Figure \ref{fig:genCheck}, top row). The one exception comes from testing on \texttt{CIFAR-10DS} labels, where training on higher levels of softness is preferred. 
% Optional sentence about raja labels being hard to learn.
These results suggest that, for image classification, \textit{sparse soft labels}---of the kind collected in \cite{collins2022eliciting} or formed via averaging \citep{peterson2019human}---best capture the representational information expressed across different soft label sets. We also find that this relationship is preserved across different model architectures, with the \texttt{Shake Shake} architecture performing best across all label sets.

\subsection{Generalization}
%Effective representation learning should support good generalization \cite{sucholutsky2023alignment}. 
To further test how well image classifiers extract representational information from soft labels, we assess their generalization performance on increasingly out-of-distribution image sets (Figure \ref{fig:genCheck}, bottom row; see Supplementary Materials for dataset details). We find that for near-distribution datasets (\texttt{CIFAR-10 50K} \citep{krizhevsky2009learning}; \texttt{CIFAR10.1v6,v4} \citep{recht2018cifar}), classifier performance follows the U-shaped relationship described above. However, for far-distribution datasets (\texttt{CINIC} \citep{cinic}, \texttt{ImageNet-Far} \citep{peterson2019human}) a different pattern emerges. As found in \cite{peterson2019human}, soft labels perform increasingly well relative to hard labels as distribution shift increases. %This also applies to the smoothed version of hard labels, suggesting that in this setting the nature of softness---and their representational informativeness---counts. 
The much improved relative performance of the \texttt{CIFAR-10DS}-trained networks supports this view---although these labels are the most noisy, they also contain the richest representational supervision signal. It may be that more expressive classifiers are needed to fully capitalize on this richness \citep{battleday2020capturing, singh2020end}.


% \begin{figure*}[htb!]
%     \centering
%     \begin{subfigure}[b]{0.55\textwidth}
%        \includegraphics[width=1\linewidth]{crosslabel_eval.png}
%        \caption{}
%        \label{fig:crosslabelEval} 
%     \end{subfigure}
    
%     \begin{subfigure}[b]{0.55\textwidth}
%        \includegraphics[width=1\linewidth]{generalization_checks.png}
%        \caption{}
%        \label{fig:genCheck}
%     \end{subfigure}
% \end{figure*}



% \subsection{Few-Shot Learning} We next empirically investigate the theoretical .... that softness is beneficial in sparse data regimes. 

% [TODO: We need the following sets of results: comparison to similarity judgments (GNMDS and NNs), generalization (including to CINIC dataset), few-shot learning]

% \section{Discussion}
% In this paper, we have provided theoretical grounding for how hidden representations can be recovered through supervised classification, and we have related the quality of these recovered representations to training parameters like number of labels, classes, and dimensions. We found that while hard labels and soft labels provide comparable amounts of information in the many-examples-but-few-classes regime, soft labels become increasingly preferable when the number of classes increases or the number of labels decreases.  Our findings explain why, for example, pre-training a classifier on ImageNet1K (1,000 classes) or ImageNet21k (21,000 classes) using hard labels may lead to decent transfer learning performance~\citep{huh2016makes, ridnik2021imagenet} but pre-training with (a form of) soft labels may lead to even better transfer learning performance~\citep{Xie_2020_CVPR}. We support our theoretical contributions with empirical results on a suite of human-derived soft labels.

% We note that, in our analysis, we made no assumptions about the data distribution in stimulus space, nor the function $f(x_i)=z_i$ that maps from stimulus space to hidden representations, but when training neural networks we often assume some level of stability or invariance (i.e., a small perturbation in pixel space does not lead to drastically different perception of the image). When satisfied, assumptions about stability or invariance, often called ``inductive biases,'' allow learners to extract additional information from training examples, sometimes even in an unsupervised way when no labels are present. As a result, our analysis here can be considered as a sort of lower-bound on how much information about hidden representations a labeled training dataset can provide. We also examined each supervision signal in isolation, assuming that only labels of one type are collected. A promising future direction would be to analyze additional sources of information (inductive biases, other supervision signals, etc.) as well as the interactions between them; already, we see promising indications of mixing label types in the case of \texttt{CIFAR-10S+hard} and \texttt{CIFAR-10S+dense}. 

% Notwithstanding these limitations, our analysis of hard labels, soft labels, and sparse labels that interpolate between them, already enables researchers to develop cost-benefit tradeoff curves in order to optimize the cost of labeling their datasets for supervised learning---and support the development of data-efficient, generalizable ML systems. 

\section{Discussion}
In this paper, we offer a principled set of theoretical and empirical findings aimed at helping researchers to determine which form of supervisory signal they ought to collect for the task at hand. We have provided theoretical grounding for how hidden representations can be recovered through supervised classification and have related the quality of these recovered representations to training parameters such as number of labels, classes, and dimensions. We found that while hard labels and soft labels provide comparable amounts of information in the many-examples-but-few-classes regime, soft labels become increasingly preferable when the number of classes increases or the number of labels decreases.  Our findings explain why, for example, pre-training a classifier on \texttt{ImageNet1K} (1,000 classes) or \texttt{ImageNet21k} (21,000 classes) using hard labels may lead to decent transfer learning performance~\citep{huh2016makes, ridnik2021imagenet} but pre-training with (a form of) soft labels may lead to even better transfer learning performance~\citep{Xie_2020_CVPR}. We support our theoretical contributions with empirical results on a suite of human-derived soft labels.

We note that, in our analysis, we made no assumptions about the data distribution in stimulus space, nor the function $f(x_i)=z_i$ that maps from stimulus space to hidden representations, but when training neural networks we often assume some level of stability or invariance (i.e., a small perturbation in pixel space does not lead to drastically different perception of the image). When satisfied, assumptions about stability or invariance, often called ``inductive biases,'' allow learners to extract additional information from training examples, sometimes even in an unsupervised way when no labels are present. As a result, our analysis here can be considered as a sort of lower-bound on how much information about hidden representations a labeled training dataset can provide. We also examined each supervision signal in isolation, assuming that only labels of one type are collected. A promising future direction would be to analyze additional sources of information (inductive biases, other supervision signals, etc.) as well as the interactions between them; already, we see promising indications of mixing label types in the case of \texttt{CIFAR-10S+hard} and \texttt{CIFAR-10S+dense}. 

Notwithstanding these limitations, our analysis of hard labels, soft labels, and sparse labels that interpolate between them, already enables researchers to develop cost-benefit tradeoff curves in order to optimize the cost of labeling their datasets for supervised learning---and support the development of data-efficient, generalizable ML systems. %In particular, our research affirms the promise of eliciting softness from human annotators.   

% \clearpage
\begin{acknowledgements} % will be removed in pdf for initial submission,
						 % (without ‘accepted’ option in \documentclass)
                         % so you can already fill it to test with the
                         % ‘accepted’ class option
    Briefly acknowledge people and organizations here.

    \emph{All} acknowledgements go in this section.
\end{acknowledgements}

% References
\bibliography{uai2023-template}
\end{document}
