% \documentclass{uai2022} % for initial submission
\documentclass[accepted]{uai2022} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2022} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2022} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

% Recommended, but optional, packages for figures and better typesetting:
\usepackage{microtype}
\usepackage{graphicx}
\usepackage{subfigure}
\usepackage{booktabs} % for professional tables
\newsavebox{\imagebox}
\usepackage{setspace}
\usepackage{outlines}
\usepackage{algorithm}
\usepackage{algorithmic}

\usepackage{thmtools} 
\usepackage{thm-restate}
%\declaretheorem[name=Proposition]{prop}

\usepackage{xcolor, colortbl}
\definecolor{tableHeader}{RGB}{55,126,184}
\definecolor{tableLineOne}{RGB}{245, 245, 245}
\definecolor{tableLineTwo}{RGB}{255, 255, 255}
\definecolor{specialgrey}{RGB}{90, 90, 90}

\definecolor{darkred}{rgb}{0.7,0,0}

\newcount\Comments
\Comments=1
\newcommand{\kibitz}[2]{\ifnum\Comments=1{\textcolor{#1}{\textsf{\footnotesize #2}}}\fi}
\newcommand{\ben}[1]{\kibitz{magenta}{[BVR: #1]}}
\newcommand{\ian}[1]{\kibitz{red}{[IO: #1]}}
\newcommand{\lucy}[1]{\kibitz{blue}{[LUCY: #1]}}
\newcommand{\zheng}[1]{\kibitz{purple}{[ZW: #1]}}
\newcommand{\vikranth}[1]{\kibitz{darkred}{[VD: #1]}}
% \newcommand{\morteza}[1]{\kibitz{orange}{[Morteza: #1]}}
% \newcommand{\mohammad}[1]{\kibitz{purple}{[Mohammad: #1]}}

\newcommand{\gitpublic}{\href{https://github.com/deepmind/neural_testbed}{\texttt{github.com/deepmind/neural\_testbed}}}

% \newcommand{\github}{\href{https://anonymous.4open.science/r/neural_testbed-680E/README.md}{\texttt{anonymous.4open.science/r/neural\_testbed/}}}
\newcommand{\github}{\href{https://github.com/deepmind/neural_testbed}{\texttt{github.com/deepmind/neural\_testbed}}}



\input{macros}

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

\title{Evaluating High-Order Predictive Distributions in Deep Learning}

% The standard author block has changed for UAI 2022 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{\href{mailto:<iosband@deepmind.com>?Subject=Your UAI 2022 paper}{Ian Osband}}
\author[1]{Zheng Wen}
\author[1]{Seyed Mohammad Asghari}
\author[1]{Vikranth Dwaracherla}
\author[1]{Xiuyuan Lu}
\author[1]{\mbox{Benjamin Van Roy}}
% Add affiliations after the authors
\affil[1]{%
    DeepMind
}

\begin{document}
\maketitle

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% ABSTRACT
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{abstract}
\vspace{-3mm}
Most work on supervised learning research has focused on marginal predictions.
In decision problems, joint predictive distributions are essential for good performance.
Previous work has developed methods for assessing low-order predictive distributions with inputs sampled i.i.d. from the testing distribution.
With low-dimensional inputs, these methods distinguish agents that effectively estimate uncertainty from those that do not.
We establish that the predictive distribution order required for such differentiation increases greatly with input dimension, rendering these methods impractical.
To accommodate high-dimensional inputs, we introduce \textit{dyadic sampling}, which focuses on predictive distributions associated with random \textit{pairs} of inputs.
We demonstrate that this approach efficiently distinguishes agents in high-dimensional examples involving simple logistic regression as well as complex synthetic and empirical data.
\end{abstract}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% INTRODUCTION
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\vspace{-2mm}
\section{Introduction}
\label{sec:intro}
\vspace{-2mm}

% \begin{outline}
% \1 We consider a learning agent that makes predictions.
%     \2 You don't just make marginal predictions, but also joint.
%     \2 Joint predictions are crucial to decision problems.
{
\medmuskip=2mu
\thinmuskip=1mu
\thickmuskip=2mu
We consider learning agents that are trained on data pairs $((X_t,Y_{t+1}): t=0,1,\ldots,T-1)$. % and subsequently generate predictive distributions given new inputs.
At a new input $X_T$, such an agent can generate a predictive distribution of the outcome $Y_{T+1}$ that is yet to be observed.
This distribution characterizes the agent's uncertainty about $Y_{T+1}$.
We refer to such a prediction as {\it marginal} to distinguish it from a \textit{joint} predictive distribution over a sequence of prospective outcomes $(Y_{T+1},\ldots,Y_{T+\tau})$ with inputs $(X_T,\ldots,X_{T+\tau-1})$.

Predictive distributions express uncertainty about future observations.
The importance of such uncertainty estimation has motivated a great deal of research over recent years, much of which in the Bayesian deep learning community \citep{neal2012bayesian}.
% This research has produced a variety of agents that generate predictive distributions.
With the proliferation of agents that generate predictive distributions, it is increasingly important to systematically study and improve their performance.
}

% \1 People have highlighted the importance of joint predictions.
%     \2 Most of the practical work has been with $\tau=10$.
%     \2 Real-world problems require a massive $\tau$!
Recent theoretical work has highlighted the importance of joint predictive distributions in driving effective decisions \citep{wen2022predictions}.
This theory is supported by experiments that assess and compare agents using synthetic data generated by a random neural network and 2D inputs \citep{osband2022neural}.
That work evaluates the quality of joint predictive distributions over ten inputs sampled i.i.d. from the training distribution.
The results clearly distinguish agents that effectively estimate uncertainty from those that do not.
This evaluation predicts agent performance when used to guide decisions in high-dimensional `neural bandits'.

However, as the input dimension increases, the aforementioned approach to evaluating agents becomes uninformative.
As we will later discuss, the reason lies in the order of the predictive distributions being evaluated.
With a two-dimensional input, the tenth order distribution suffices, but the predictive distribution order required to produce meaningful assessments increases rapidly with the input dimension.
We could consider scaling the predictive distribution order as needed, but the evaluation algorithms of \citet{osband2022neural} become computationally intractable.

To accommodate high-dimensional inputs, we introduce \textit{dyadic sampling}, which focuses on predictive distributions associated with random \textit{pairs} of inputs rather than those scattered according to the training input distribution.
We demonstrate that this approach efficiently distinguishes agents in high-dimensional examples involving simple logistic regression as well as complex synthetic and empirical data.
For example, agent assessments based on dyadic sampling are predictive of performance in high-dimensional neural bandits presented in results of \citet{osband2022neural}.

% this sort of correlated sampling may be beneficial, and support this intuition through the rest of the paper.
% In a $D$-dimensional space, the predictions at any $\tau \ll 2^D$ points sampled i.i.d. over the space are likely to be close to independent.
% This can be true even when the correlation structure in predictions may play a crucial role in driving decision making, or evaluating the quality of predictions as $\tau \rightarrow \infty$.
% By restricting test samples to random pairs, we target an agent's ability to make predictions over some outcomes that are highly correlated and others that are not.
% Surprisingly, we will show that this simple heuristic offers several important benefits in practical problems.



%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Related work
\vspace{-2mm}
\subsection{Related work}
\label{sec:related_work}
\vspace{-2mm}

% \begin{outline}
% \1 Some of the robustness/OOD work maybe?
% \end{outline}

% % \1 Theory paper
% Theoretical results of \citet{wen2022predictions} suggest that an agent must be able to produce accurate predictive distributions in order to perform well in decision tasks and that the required order of these distribution grows with complexity of the environment and task.  These observations are motivate the need to assess high-order predictive distributions.  Dyadic sampling offers a means to efficiently assess high-order predictive distributions that are restricted but nevertheless informative about an agent's ability to estimate uncertainty.

% % \1 Marginal Wang paper
% % \1 Neural Testbed
% Recently, the Bayesian deep learning community has begun to consider the importance of predictive distributions beyond marginals.
% \cite{wang2021beyond} examined a notion of normalized cross-correlation in regression problems and related decision problems.
% \cite{osband2022neural} studied KL-divergence in classification and released \textit{The Neural Testbed} as an opensource library to automate this analysis.
% However, these tools and techniques are not suitable for challenging contexts addressed by deep learning, where inputs are high-dimensional.

% % \1 Some of the robustness/OOD work maybe?
% Our work shares some of the empirical orientation of \citet{hendrycks2019benchmarking} and other work on ``out of distribution robustness.''  In particular, 
% we aim to develop a practical heuristic for evaluating the quality of agent predictions.
% Unlike previous work, we examine this problem through the lens of joint predictive distributions.


% Our main motivation is from theory + neural testbed paper
We are motivated by the importance of joint predictions in driving effective decisions \citep{wen2022predictions}.
Empirical analysis of joint predictions for deep learning in 2D supports this theory \citep{osband2022neural}.
We provide a practical heuristic to scale these insights to high dimensions.


% Related Bayesian deep learning
Our research is closely related to topics in Bayesian deep learning \citep{mackay1992practical, wilson2020bayesian}, and robustness \citep{hendrycks2019benchmarking}.
For the most part, these communities have focused on the problem of marginal prediction \citep{nado2021uncertainty, wilson2021evaluating}.
Recent work has also highlighted a notion of cross-correlation in regression and related decision problems \citep{wang2021beyond}.
Our paper provides a related perspective that scales to classification and high dimensional data.



%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Key contributions
\vspace{-3mm}
\subsection{Key contributions}
\label{sec:contributions}
\vspace{-2mm}


% \begin{outline}
% \1 Introduce this kappa 2 metric, provide some theoretical base.
We propose \textit{dyadic sampling}, which evaluates high-order joint predictions at random pairs of inputs.
Section~\ref{sec:theory} motivates the approach, and shows that it can mitigate some challenges in evaluating high-order predictive distributions.

% \1 Clear demonstration of this theory in the case of logistic regression.
Section~\ref{sec:logistic_regression} shows that dyadic sampling provides useful assessments in logistic regression.
As input dimension scales, i.i.d. sampling from the training distribution does not offer a feasible approach.
Dyadic sampling  offers a viable path where the evaluation of \citet{osband2022neural} is inadequate.

% \1 Incorporate this metric into Neural Testbed.
%     \2 Without this metric the testbed doesn't scale to high dimensions.
%     \2 With this metric the testbed DOES scale to high dimensions.
Section~\ref{sec:neural_testbed} extends these insights to \textit{The Neural Testbed} -- an opensource package for the evaluation of joint predictions in deep learning.
As in logistic regression, the neural network generative process is not amenable to evaluation via i.i.d. sampling when the input dimension exceeds three.
In contrast, dyadic sampling scales gracefully as the input dimension grows large.
As part of this project, we submit all agent and evaluation code to \github.

% \1 First evaluations of these neural testbed agents on real data.
%     \2 We show clearly that there is a relationship between testbed and real data.
%     \2 Testbed said agents were the same tau=1... they are basically the same real data.
%     \2 Testbed performance tau=10 highly correlated with tau=10 on real data.
% \end{outline}
Section~\ref{sec:real_data} shows that our methodology can extend beyond synthetic data.
Dyadic sampling can feasibly evaluate joint predictions on high-dimensional real datasets.
We evaluate benchmark approaches to Bayesian deep learning and show that the insights from the Testbed carry over to real data.
We see that, after tuning, all agents perform similarly in terms of marginal predictions.
However, there are significant differences in the quality of \textit{joint} predictions per agent, evaluated via dyadic sampling.
Further, Testbed performance is highly predictive of performance on empirical data.




%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% PREDICTIVE DISTRIBUTIONS
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


\vspace{-1mm}
\section{Evaluating predictives}
\label{sec:theory}
\vspace{-2mm}

This section introduces notation for the standard supervised learning framework we will consider as well as our evaluation metric: KL-loss.
We show that estimating KL divergence in high dimensional distributions can be challenging, and present dyadic sampling as an effective heuristic.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Formulation
\subsection{Environment and predictions}
\label{sec:theory_environment}
\vspace{-2mm}

Consider a sequence of pairs $((X_t, Y_{t+1}): t =0,1,2,\ldots)$, where each $X_t$ is a feature vector and each $Y_{t+1}$ is its target label.  Each target label $Y_{t+1}$ is produced by an {\it environment} $\environment$, which we formally take to be a conditional distribution $\environment(\cdot|X_t)$.  The environment $\environment$ is a random variable; this reflects the agent's uncertainty about how labels are generated.  Note that $\Prob(Y_{t+1} \in \cdot | \environment, X_t) = \environment(\cdot|X_t)$ and $\Prob(Y_{t+1} \in \cdot | X_t) = \E[\environment(\cdot|X_t) | X_t]$.

We consider an agent that learns about the environment from training data \mbox{$\data_T \equiv ((X_t, Y_{t+1}): t =0,1,\ldots, T-1)$}.
After training, the agent predicts testing class labels $Y_{T+1:T+\tau} \equiv (Y_{T+1}, \dots, Y_{T+\tau})$ from unlabeled feature vectors \mbox{$X_{T:T+\tau-1} \equiv (X_T, \dots, X_{T+\tau-1})$}.

% Let $\data_T \equiv ((X_t, Y_{t+1}): t =0,1,\ldots, T-1)$, be a training dataset
% where each  $X_t$ is a feature vector and each $Y_{t+1}$ is its target label.
% Feature vectors $X_t$ are i.i.d.
% Each target label $Y_{t+1}$ is independent of all other data, conditioned on $X_t$, and distributed according to $\environment (\cdot | X_t)$.
% The conditional distribution $\environment$ is referred to as the \emph{environment}.
% The environment $\environment$ is random; and this reflects the agent's uncertainty about how labels are generated given features. Note that $\Prob(Y_{t+1} \in \cdot | \environment, X_t) = \environment(\cdot|X_t)$ and $\Prob(Y_{t+1} \in \cdot | X_t) = \E[\environment(\cdot|X_t) | X_t]$.


% We consider an agent that learns about the environment from training data $\data_T$,
% and predicts class labels $Y_{T+1:T+\tau} \equiv (Y_{T+1}, \dots, Y_{T+\tau})$ 
% given test inputs $X_{T:T+\tau-1} \equiv (X_T, \dots, X_{T+\tau-1})$.

We describe the agent's predictions in terms of a generative model, parameterized by a vector $\theta_T$ that the agent learns from the training data $\data_T$. Specifically, $\theta_T$ parameterizes a distribution $\Prob(\hat{\environment} \in \cdot | \theta_T)$ over imagined environment $\hat{\environment}$, which is also a conditional distribution. For any inputs $X_{T:T+\tau-1}$, to generate the imagined labels $\hat{Y}_{T+1:T+\tau}$, the agent first samples an imagined environment $\hat{\environment}$ from $\Prob(\hat{\environment} \in \cdot | \theta_T)$, then generates $\hat{Y}_{t+1} \sim \hat{\environment}(\cdot | X_t) $ conditionally i.i.d. for each $t=T,\ldots, T+\tau-1$.


The agents $\tau^{\rm th}$-order predictive distribution is given by
$$\hat{P}_{T+1:T+\tau} \equiv \Prob(\hat{Y}_{T+1:T+\tau} \in \cdot | \theta_T, X_{T:T+\tau-1}),$$
which represents an approximation to what would be obtained by conditioning on the environment:
$$P^*_{T+1:T+\tau} \equiv \Prob \left(Y_{T+1:T+\tau} \in \cdot \middle | \environment , X_{T:T+\tau-1} \right).$$
If $\tau=1$, this represents a marginal prediction of a single label for a single feature vector.
For $\tau>1$, this is a joint prediction over $\tau$ labels for $\tau$ different feature vectors.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Joint predictions
\subsection{Evaluating joint predictions}
\label{sec:theory_joint}


% As you can make better and better predictions, then you really understand the world better.
A learning agent can be assessed through the quality of its predictive distribution $\hat{P}_{T+1:T+\tau}$.
A canonical approach is to evaluate the KL-divergence \citep{wen2022predictions},
\begin{eqnarray}
\label{eq:kl_delta}
\Delta_\tau &\equiv& \KL \big( P^*_{T+1:T+\tau} \big \| \hat{P}_{T+1:T+\tau} \big) \\
\label{eq:kl_tau}
\KL^\tau &\equiv& \E \big[ \Delta_\tau \big].
\end{eqnarray}
Recall that the expectation represents an integral over all random variables.
The minimum of $\KL^\tau$ over all agents that depend on the environment only through $\data_T$ is attained by the posterior agent, whose predictive distribution is
%
% over predictive distributions $\hat{P}_{T+1:T+\tau}$ that depend on the environment only through $\data_T$ is attained by the posterior predictive
\begin{equation}
\label{eq:posterior}
\overline{P}_{T+1:T+\tau} \equiv \Prob(Y_{T+1:T+\tau} \in \cdot | \data_T, X_{T:T+\tau-1}).
\end{equation}
Let $\KLBAR^\tau$ denote the minimum achievable KL-divergence.

Algorithm~\ref{alg:kl-computation} provides a simple Monte-Carlo approach to evaluate $\KL^\tau$.
As the order of the predictive distribution $\tau$ grows, this provides a more nuanced evaluation of agent beliefs than just marginals.
% In the limit $\tau \rightarrow \infty$ $\KL^\tau$
However, even for simple problems, the magnitude of $\tau$ required to provide additional insight beyond marginals can become intractably large.
To anchor our thinking on this matter we consider a simple coin tossing example.

{\small
\begin{algorithm}[tb]
\caption{KL-Loss Estimation \citep{osband2022neural}.}
\label{alg:kl-computation}
\begin{algorithmic}
\FOR{$j=1,2,\ldots,J$}
\STATE sample environment and training data
\STATE train agent on training data
\FOR{$n=1,2,\ldots,N$}
\STATE sample $\tau$ test data pairs % $X_{T:T+\tau-1}, Y_{T+1:T+\tau}$
\STATE compute environment likelihood $p_{j,n}$
\STATE compute agent likelihood $\hat{p}_{j,n}$
\ENDFOR
\ENDFOR
\RETURN $\frac{1}{JN} \sum_{j=1}^J \sum_{n=1}^N \log \left(p_{j,n} / \hat{p}_{j,n} \right)$
\end{algorithmic}
\end{algorithm}
}

\vspace{-2mm}
\begin{example}[Bag of coins]
\label{ex:beta_coins}
\vspace{-2mm}
Let each $X_t$ be a sample from coins $\{1,\ldots,M\}$.  Let the probability of heads $p_x \sim {\rm Unif}(0,1)$ i.i.d. for each coin $x$.
Each observation $Y_{t+1}$ is the outcome from tossing coin $X_t$, so that $\Ec(1 | X_t) = p_{X_t}$.
\vspace{-2mm}
\end{example}

% \ian{Need to somehow introduce notation for the `best reasonable KL'... I'll go with $\KLBAR^\tau$... this is the one that is achieved by the Bayes posterior.}

% {
% \medmuskip=2mu
% \thinmuskip=1mu
% \thickmuskip=2mu
% Consider an agent that makes predictions $\Prob(\hat{Y}_{1:\tau} |X_{0:\tau-1}) = \prod_{t=0}^{\tau-1} \Prob(\hat{Y}_{t+1} |X_t) = 1/2^\tau$.
% % These predictions assume toss outcomes are independent and unbiased, but are still minimize $\KL^1$.
% Suppose an agent uses this predictive distribution to selects coins sequentially that maximize the expected number of heads.
% While $\hat{Y}_{1:\tau}$ accurately minimizes $\KL^1$, the agent assumes that toss outcomes are independent and therefore does not learn from history to improve successive choices.
% Accounting for dependencies arising in the joint distribution, as would be captured by $\KL^\tau$, is essential to maximizing performance.
% }

{
\medmuskip=2mu
\thinmuskip=1mu
\thickmuskip=2mu
Let us consider a predictive distribution for which $\Prob(\hat{Y}_{1:\tau} |X_{0:\tau-1}) = \prod_{t=0}^{\tau-1} \Prob(\hat{Y}_{t+1} |X_t) = 1/2^\tau$.
% This agent assumes that toss outcomes are independent and unbiased.
Suppose an agent uses this to select coins sequentially with the aim of maximizing the expected number of heads.
While $\hat{Y}_{1:\tau}$ accurately minimizes $\KL^1$, the agent assumes that toss outcomes are independent and therefore does not learn from history to improve successive choices.
Accounting for dependencies arising in the joint distribution, as would be captured by $\KL^\tau$, is essential to maximizing performance.
}

\begin{restatable}[Small $\tau$ approximately marginal]{proposition}{smalltau}
\label{prop:small_tau}
If the agent defined above is applied to Example~\ref{ex:beta_coins} with $\tau \ll M$,
\vspace{-1mm}
\begin{equation}
\KL^\tau = \KLBAR^\tau + O\left( \tau^3 /M  \right). \nonumber
\end{equation}
\end{restatable}
\begin{proof}
\vspace{-3mm}
Note that under the event that there are no repeated inputs in $X_{0:\tau-1}$, the posterior agent is equivalent to the agent defined above. For $\tau \ll M$, this event occurs with high probability. The detailed proof is in Appendix~\ref{app:proof_small_tau}.
\end{proof}
\vspace{-2mm}


% Give some intuition for this result, and then show how you could remedy this.
Proposition~\ref{prop:small_tau} shows that if $\tau \ll M$, then $\KL^\tau$ is unable to distinguish agents that only match marginals from those that are useful for decision making.
When the cardinality of the input space $M$ is much larger than the test distribution order $\tau$ it is unlikely that any correlated inputs will be sampled.
The metric $\KL^\tau$ punishes agents that impose an erroneous correlation, but is unlikely to reward agents that correctly capture this dependence until $\tau$ is sufficiently large.


% Remedy this approach with polyadic sampling.
In Example \ref{ex:beta_coins}, it may suffice to use a value of $\tau$ that grows cubicly in $M$.
However, due to the curse of dimensionality,
% when inputs are sampled from a Euclidean space and the relation between input and output is flexible,
the required magnitude of $\tau$ can grow exponentially in problem dimension.
To handle such cases,  we introduce a practical evaluation metric that correctly identifies high quality predictive distributions with modest values of $\tau$.

For this result, we introduce notation for assignment: for random variables $A, B, C$ and a function $f(c) \equiv \E[A|B = c]$, let $\E[A|B\leftarrow C] = f(C)$.  Note that, in general, if $C$ is a random variable then $\E[A|B\leftarrow C] \neq \E[A|B = C]$.

\begin{definition}[Polyadic test sampling (of order $\kappa$)]
\label{def:polyadic}
For any $\kappa \in \Nat$, let `anchor points' $\overline{X}_{1:\kappa}$ be drawn i.i.d. from $\Prob(X_t \in \cdot)$, and let $\tilde{X}^\kappa_{T:T+\tau-1} \sim {\rm Unif}\{\overline{X}_{1:\kappa}\}$.
We define,
\begin{eqnarray}
\label{eq:polyadic}
% \KL^{\tau, \kappa} &\equiv& \Exp\left[ \Exp\left[ \Delta_\tau  \mid X_{T:T+\tau-1} = \tilde{X}^\kappa_{T:T+\tau-1} \right] \right]. \\
\KL^{\tau, \kappa} &\equiv& \Exp\left[ \Exp\left[ \Delta_\tau  \mid X_{T:T+\tau-1} \leftarrow \tilde{X}^\kappa_{T:T+\tau-1} \right] \right].
\end{eqnarray}
\end{definition}

% % Restate the definition so people can understand it.
% Polyadic sampling draws inputs by first sampling $\kappa$ anchor points from the training input distribution, and then repeatedly sampling from these anchor points.

% Some motivation of polyadic sampling
Polyadic sampling is motivated by a desire to investigate an agent's predictions in situations where correlation between predictions is more likely to play an important role.
Under reasonable regularity assumptions $\lim_{\kappa \rightarrow \infty} \KL^{\tau, \kappa} = \KL^\tau$ for all $\tau$.
In the case of $\kappa < \tau$, we can ensure that at least one input will be sampled multiple times.
In the special case of $\kappa=1$, we call this \textit{monadic} sampling.

\begin{restatable}[Monadic sampling cannot spot bad agents]{proposition}{monadic}
\label{prop:monadic}
Consider an agent that ignores the inputs and predicts
$$\Prob(\hat{Y}_{1:\tau} | X_{0:\tau-1}) = \Prob(\hat{Y}_{1:\tau}),$$
where $\hat{Y}_{1}, \dots, \hat{Y}_{\tau}$ are sampled independently from ${\rm Ber}(\hat{p})$ with a shared parameter $\hat{p} \sim {\rm Unif}(0, 1)$.
Then, for any $\tau \in \Nat$ in Example~\ref{ex:beta_coins} this agent achieves the minimum $\KL^{\tau,1}$ over all agents.
\end{restatable}

\begin{proof}
\vspace{-3mm}
This agent is constructed so that for any repeated inputs $X_{0:\tau-1}=\tilde{X}^1_{0:\tau-1}$, this agent's predictive distribution matches that of the posterior agent.
\end{proof}
\vspace{-2mm}

% \begin{lemma}[Monadic sampling cannot spot bad agents]
% \label{lemma:monadic}
% Consider an agent that predicts
% $$\Prob(\hat{Y}_{T+1:T+\tau} \in \cdot | X_{T:T+\tau-1}) = \Prob(\hat{Y}_{T+1:T+\tau} \in \cdot),$$
% where $\hat{Y}_{T+1}, \dots, \hat{Y}_{T+\tau}$ are sampled independently from ${\rm Ber}(\hat{p})$ with a shared parameter $\hat{p} \sim {\rm Unif}(0, 1)$.
% Then, for any $\tau \in \Nat$ in Example~\ref{ex:beta_coins} this agent would achieve the minimum $\KL^{\tau,1}$ over all agents.
% \begin{proof}
% This agent is constructed so that for any repeated inputs $X_{T:T+\tau-1}=\tilde{X}^1_{T:T+\tau-1}$ the agent matches the posterior distribution.
% \end{proof}
% \end{lemma}


% Monadic sampling is not very good
Monadic sampling can examine whether an agent understands the correlation structure at a single input.
However, Proposition~\ref{prop:monadic} shows that it does not punish agents that erroneously ascribe correlation to independent input-output pairs.
This agent achieves the best possible score in $\KL^{\tau, 1}$ but is useless for driving decisions.
In order to weed out these agents it is crucial to sample more than one input point.
% Lemma~\ref{lemma:monadic} shows that this approach can be woefully shortsighted when it comes to evaluating an agent for decision making.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Joint predictions
\subsection{Dyadic sampling $(\kappa=2)$}
\label{sec:theory_dyadic}

% \begin{outline}

% \1 We want to introduce dyadic sampling as a method of sampling \textit{pairs} of points.
This paper introduces \textit{dyadic} test input sampling $(\kappa=2)$ as a practical heuristic for assessing the quality of joint predictions in high dimensions.
This sampling scheme samples two random anchor points from the input space, and then randomly resamples the $\tau$ inputs from these anchor points.
Even with moderate $\tau=10$, we can be sure that most batches will contain a mix of points that are highly correlated to each other, as well as some others which may be quite different.


% Dyadic sampling is a heuristic that fixes some of the issues, but doesn't fix everything.
Dyadic sampling is a heuristic approach designed to work well in practical problems.
The choice of $\kappa=2$ addresses the extreme shortcomings of $\KL^{\tau, \kappa}$ by Propositions \ref{prop:small_tau} and \ref{prop:monadic} in the settings $\kappa \rightarrow \infty$ and $\kappa=1$ respectively.
However, it is certainly not a perfect substitute for evaluating $\KL^\tau$ with very large $\tau$.
Depending on the setting, it is certainly possible to design agents that fare very well according to $\KL^{\tau, 2}$, but very poorly according to $\KL^\tau$.


% Should we be picking a different kappa=3?
One might ask,  `Does some other $1 < \kappa < \infty$ provide a better candidate for practical evaluation of posterior predictives?'.
Could there be an analagous result to Proposition~\ref{prop:monadic} when considering $\kappa=2$, but evaluating posterior predictions at \textit{three} anchor points?
Note that, since $\KL^{\tau, 2}$ already evaluates the quality of the joint predictions at any pair of inputs, then for most problems the distribution over any three inputs will also be estimated well.
In particular,
for any Gaussian process, the first two moments are enough to determine the entire distribution of $\Ec$.
We push details to Appendix~\ref{app:gaussian_process}.



%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Joint predictions
\subsection{Joint predictions and information}
\label{sec:theory_information}

% \ian{Think some of the ideas Zheng/Ben were talking about could be useful here.}

% \begin{outline}

% \1 This is a subsection where we look at the meaning of these joint predictions through information gain.
% \fillpara

% \1 Derive the decomposition for the posterior via chain rule.
% \fillpara

% \1 Give some argument for why we expect realistic problems in high dimensions to behave a bit like the problems we outline in our claims.
% \fillpara

% \1 Give enough to sketch an argument but push details to Appendix~\ref{app:theory}

% \end{outline}


% Expand on the intuition of dyadic sampling with an information theory perspective
So far, we have motivated dyadic sampling mostly through appeal to Example~\ref{ex:beta_coins}, together with some heuristic arguments.
In this subsection we expand on this intuition through the lens of information theory.

% The posterior agent is optimal and we can breakdown the difference in \KL^tau, KL^1
To illustrate this, let's consider the posterior agent, which is optimal for generating predictive distributions. Note that under the posterior agent,
% If we consider the posterior agent \eqref{eq:posterior}, which is optimal for generating predictive distributions,
\begin{align}
    \KL^\tau =& \, \I \big (Y_{T+1:T+\tau}; \environment \big | \data_T, X_{T:T+\tau-1} \big) \nonumber \\
    =& \, \textstyle \sum_{t=T}^{T+\tau-1} 
    \I \big (Y_{t+1}; \environment \big | \data_T, \data_{T:t}, X_t \big),
\end{align}
where $\I(\cdot)$ denote the (conditional) mutual information \citep{cover1999elements} and
$\data_{T:t} \equiv (X_{T:t-1}, Y_{T+1:t})$.
Note that the second equality follows from the chain rule of mutual information. On the other hand, 
\begin{align}
    \tau \KL^1 =& \, \tau \I \big (Y_{T+1} ; \environment \big | \data_T, X_{T} \big) \nonumber \\
    =& \, \textstyle \sum_{t=T}^{T+\tau-1} 
    \I \big (Y_{t+1} ; \environment \big | \data_T, X_{t} \big),
\end{align}
where the second equality follows from $X_{T:T+\tau-1}$ are i.i.d.

% KL^tau only going to give benefits when the points are very informative... dyadic sampling is a way to do this.
For $\KL^\tau$ to be significantly different from $\tau \KL^1$, we need for at least one $t$, the dataset $\data_{T:t}$ is informative about the target label $Y_{t+1}$ at $X_t$.
For practical problems with $\tau$ small relative to the input space, the $\data_{T:t}$ is not informative about $Y_{t+1}$.
In such cases, we have $\KL^\tau \approx \tau \KL^1$.
One way to think about dyadic sampling is a heuristic approach to sample $X_{T:T+\tau-1}$ so that $D_{T:t}$ is particularly informative about $Y_{t+1}$ and so evaluate the quality of the posterior approximation.
Depending on the problem settings, other input sampling schemes may also be appropriate to accomplish this goal.








%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% LOGISTIC REGRESSION
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Logistic regression}
\label{sec:logistic_regression}

% This is a section to outline the logistic regression problem.
% This is meant to be a simple sanity check that illustrates the points we made in Section \ref{sec:beyond_marginals}.

The results of Section \ref{sec:theory} provide a motivation for dyadic sampling where it can sidestep the curse of dimensionality in higher-order predictive distributions.
In this section, we show that this effect can occur in practical settings, not just obtuse problems cooked up for theory.
In fact, even for the canonical problem of logistic regression, the benefits of dyadic sampling can be significant.

% \fillpara

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Logistic problem
\vspace{-1mm}
\subsection{Problem formulation}
\label{sec:logistic_problem}
\vspace{-1mm}

% \begin{outline}
% \1 Describe the logistic regression problem formulation.
%     \2 We are choosing this because it's really simple.
We consider the familiar problem of $D$-dimensional logistic regression.
Inputs are sampled i.i.d. $X_t \sim N(0,I_D)$ and the environment $\Ec$ is determined by parameter $\phi \sim N(0, I_D)$.
Outputs $Y_{t+1} \in \{0, 1\}$ are then sampled according to
$$\Prob(Y_{t+1} = 1 | \Ec, X_t) = \frac{\exp(\rho \phi^T X_t)}{\exp(\rho \phi^T X_t) + 1}.$$
Here $\rho > 0$ is the temperature controlling signal to noise ratio (SNR).
We set $\rho\hspace{-0.5mm}=\hspace{-0.5mm}0.01$ for a high SNR setting.

% \1 Describe our three `canary' agents:
%     \2 Uniform
%     \2 Marginal
%     \2 Prior
In this simple setting, we can compare three agents that make predictions $\hat{Y}_{1:\tau}$ given inputs $X_{0:\tau-1}$.
{
\medmuskip=2mu
\thinmuskip=1mu
\thickmuskip=2mu
\begin{enumerate}[noitemsep, nolistsep, leftmargin=*]
    \item \textbf{\texttt{uniform}}: $\Prob(\hat{Y}_{t+1} = 1 | X_t) = \frac{1}{2}$ for $t=0,1,..$.
    \item \textbf{\texttt{marginal}}: Samples $\lambda \sim N(0, 1)$, and then predicts $\Prob(\hat{Y}_{t+1} = 1 | \lambda, X_t) = \frac{\exp(\rho \lambda \|X_t\|_2)}{ \exp(\rho \lambda \|X_t\|_2) + 1}$ for $t=0,1,..$.
    \item \textbf{\texttt{prior}}: Samples $\hat{\phi} \sim N(0, I_D)$, and then predicts $\Prob(\hat{Y}_{t+1} = 1 | \hat{\phi}, X_t) = \frac{\exp(\rho  \hat\phi^T X_t)}{\exp(\rho \hat\phi^T X_t) + 1}$ for $t=0,1,..$.
\end{enumerate}
}
% \ian{This probably needs some adult notation supervision.}

% \1 Explain that we pick these agents because we know exactly what \textit{should} be going on in this case.
% \end{outline}
The agents are chosen to highlight specific properties of the logistic regression problem.
The \texttt{uniform} agent makes the correct marginal predictions at any input, but does not capture any correlation among $Y_{1:\tau}$.
The \texttt{marginal} agent makes the correct marginal predictions, and it also gets the correct joint distribution if inputs $X_{0:\tau-1}$ are all sampled at a \textit{single} point (monadic sampling).
However, it introduces spurious correlation among the predicted outputs if the inputs are not all equal.
The \texttt{prior} agent samples from the true prior, and so is optimal for all $\KL^{\tau, \kappa}$.
We would like to have a practical evaluation metric that can separate this \textit{optimal} agent from these sub-optimal approximations.

We consider metrics $\KL^\tau$ and $\KL^{\tau, \kappa}$ for $\kappa = 1, 2$, all of which are estimated through Monte Carlo sampling according to Algorithm \ref{alg:kl-computation}.
Despite the simplicity of this problem, only dyadic sampling ($\KL^{\tau, \kappa=2}$) can correctly separate the agents once the input dimension grows.




    


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Logistic high
\vspace{-2mm}
\subsection{Results}
\label{sec:logistic_results}
\vspace{-1mm}



\begin{figure}[!ht]
    \centering
    \includegraphics[width=0.95\columnwidth]{figures/logistic_tau_scaling.pdf}
    \vspace{-3mm}
    \caption{$\KL^\tau$ can separate \texttt{prior} agent from \texttt{uniform}, but the required $\tau$ is intractable in the input dimension. Dyadic sampling $\KL^{\tau, \kappa=2}$ can distinguish these agents with a small value of $\tau$.}
    \label{fig:logistic_tau_scaling}
    \vspace{2mm}
\end{figure}

\begin{figure}[!ht]
    \centering
    \includegraphics[width=0.95\columnwidth]{figures/logistic_agents.pdf}
    \vspace{-3mm}
    \caption{Comparing KL estimates under different input sampling schemes.
    Sampling test inputs i.i.d. cannot distinguish \texttt{uniform} agent from the \texttt{prior} agent in high dimensions.
    Monadic sampling cannot distinguish the \texttt{prior} agent from \texttt{marginal}.
    Dyadic sampling correctly identifies \texttt{prior} agent from \texttt{uniform} and \texttt{marginal}.}
    \label{fig:logistic_agents}
\end{figure}


% \begin{outline}
% \1 $\KL^\tau$ could separate agents, but you need exponential $\tau$ in dimension. $\KL^{\tau, 2}$ can do it in $\tau=10$ empirically (Figure \ref{fig:logistic_tau_scaling}).
As the \texttt{prior} agent makes optimal predictions in this problem, in principle, this agent will outperform all others according to $\KL^\tau$ as the order of the predictive distribution $\tau$ grows.
However, to separate the agents, the required $\tau$ can quickly become intractable in the input dimension. 

Figure~\ref{fig:logistic_tau_scaling} shows that, in logistic regression, for dimension $D \ge 5$, even $\tau=10,000$ is insufficient to give a factor of $2$ separation between the optimal \texttt{prior} agent and the uninformed \texttt{uniform} agent.
The computational cost of evaluating $\KL^\tau$ grows with $\tau$, so that this can quickly becomes impractical even for relatively small-scale problems.
By contrast, evaluation with $\KL^{\tau, \kappa=2}$ is able to identify this separation with only $\tau=10$ even as the input dimension grows.


% \1 If we fix $\tau=10$ then this does not scale up to high dimensions. $\kappa=1$ cannot distinguish the marginal agent from prior. $\kappa=2$ works as desired in this setting (Figure \ref{fig:logistic_agents}).
Figure~\ref{fig:logistic_agents} shows that this scaling carries over to high dimensions, fixing $\tau=10$.
Sampling test inputs i.i.d. cannot distinguish the \texttt{uniform} agent from the \texttt{prior} agent in dimensions greater than $100$.
At a high level, this result matches the spirit of Proposition~\ref{prop:small_tau}.
Figure~\ref{fig:logistic_agents} also shows that monadic sampling $\KL^{\tau, \kappa=1}$ cannot distinguish the \texttt{prior} agent from the \texttt{marginal} agent.
This mirrors Proposition~\ref{prop:monadic}, but in a setting with generalization.
Dyadic sampling $\KL^{\tau, \kappa=2}$ correctly identifies that \texttt{prior} agent is a superior agent across all input dimensions.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Trying to get the testbed table and results on the page that I want them!

\begin{table*}[!th]
\caption{Summary of benchmark agents, full details in Appendix \ref{app:testbed_agents}.}
\vspace{-4mm}
\begin{center}
{\footnotesize
\begin{tabular}{|l|l|l|}
\hline
\rowcolor{tableHeader}
\textcolor{white}{\textbf{agent}}          & \textcolor{white}{\textbf{description}}            & \textcolor{white}{\textbf{hyperparameters}} \\[0.5ex]  \hline
\textbf{\texttt{mlp}}            & Vanilla MLP        &  $L_2$ decay                        \\
\textbf{\texttt{ensemble}}       & `Deep Ensemble' \citep{lakshminarayanan2017simple}          & $L_2$ decay, ensemble size                        \\
\textbf{\texttt{dropout}}    & Dropout \citep{Gal2016Dropout}             &          $L_2$ decay, network, dropout rate                \\
\textbf{\texttt{bbb}}            & Bayes by Backprop  \citep{blundell2015weight}        &    prior mixture, network, early stopping                \\
\textbf{\texttt{hypermodel}}     & Hypermodel \citep{Dwaracherla2020Hypermodels} &                    $L_2$ decay, prior, bootstrap, index dimension \\
\textbf{\texttt{ensemble+}} & Ensemble + prior functions  \citep{osband2018rpf}  &      $L_2$ decay, ensemble size, prior scale, bootstrap                    \\
\textbf{\texttt{sgmcmc}}         & Stochastic Langevin  MCMC \citep{welling2011bayesian}  &               learning rate, prior, momentum           \\ \hline
\end{tabular}
}
\end{center}
\label{tab:agent_summary}
\end{table*}


\begin{figure*}[!th]
    \centering
    \includegraphics[width=0.99\textwidth]{figures/overall_performance_global_kappa_100D.pdf}
    \vspace{-4mm}
    \caption{Comparing different agents on the testbed problems with input dimension $D=100$.
    We see that the results for marginal $\KL^1$ and joint $\KL^{10}$ with i.i.d. test sampling do not show any significant difference in performance.
    By contrast, dyadic sampling $\KL^{10, 2}$ clearly separates agent performance in joint versus marginals.}
    \label{fig:testbed_global_kappa_100D}
\end{figure*}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


% \1 We push more discussion of robustness/sensitivity of these results to Appendix \ref{app:logistic}.
These results clearly demonstrate that the theoretical concerns raised in Section~\ref{sec:theory} actually occur in practical problems.
% We provide empirical evaluations that match the predictions of Propositions~\ref{prop:small_tau} and \ref{prop:monadic}.
Further, these concerns can occur even in the most simple settings of logistic regression, rather than contrived scenarios.
We push the details on the robustness/sensitivity of these results to Appendix~\ref{app:logistic}.

% \end{outline}




%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% NEURAL TESTBED
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


\section{The Neural Testbed}
\label{sec:neural_testbed}

% \begin{outline}
% \1 Previously \citep{osband2021evaluating} offered a simple 2D testbed to evaluate predictive distributions. 
% \1 We extend this to higher dimensional problems, but the  $\KL^\tau$ metric is not practical as we go for higher dimensions due to the requirement of large $\tau$, as shown in Section \ref{sec:logistic_regression}.
% \1 Kappa2 metric for joint performance estimation.

% \end{outline}


In this section we show that the insights observed in the linear setting of Section~\ref{sec:logistic_regression} extend to nonlinear function approximation and neural networks.
\citet{osband2022neural} introduce the Neural Testbed as a simple synthetic 2D problem to evaluate posterior predictives in deep learning.
We show that, using the exisitng $\KL^\tau$ evaluation, this approach does not scale to higher dimensions.
However, using dyadic sampling we are able to extend these insights to practical scales.
As part of our work we contribute these changes to \github.


% Previously it had only been for 2D... we show a reasonable way to scale this up.

% \citep{osband2021evaluating} introduces neural testbed to evaluate agents on their predictive uncertainties. The main limitation of the testbed is that it only considered 2D problems. Extending this to higher dimensional problems is not trivial due to need of scaling $\tau$ to observe a meaningful separation between the agents. In this section, we scale the neural testbed by introducing problems with both $10$ and $100$ dimensions in addition to $2$ dimensional problems in the original testbed \citep{osband2021evaluating}. In the new testbed, we make use of kappa2 metric to provide us insights, about predictive distributions of agents, without the need to use large $\tau$.

% Show how we take these insights from that simple problem and apply them to the neural testbed.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Testbed problem
\subsection{Problem formulation}
\label{sec:testbed_problem}

% \begin{outline}
% \1 Same generative process as the one in \citep{osband2021evaluating}.
% \1 We sweep over
% \2 Input dimension in $\{2, 10, 100\}$
% \2 Temperature in $\{ 0.01, 0.1, 0.5 \}$
% \2 Data ratio in $\{1, 10, 100, 1000 \}$
% \2 5 random seeds per setting
% \2 Evaluation with $\tau=1$ and $\tau=10$. 
% \2 Results are aggregated over all input dimensions, temperatures and data ratio per each tau. 
% \end{outline}

% Introducing the neural testbed
The Neural Testbed works with a synthetic data generating process around random 2-layer MLPs \citep{osband2022neural}.
For each random seed, a random neural network is sampled according to standard Xavier initialization \citep{glorot2010understanding}.
Then, random train/test inputs are sampled $X_{1:T+T'} \sim N(0,I)$ and labels assigned randomly according to the probabilities of the generative MLP.
We follow the exact settings in the existing opensource package except for two key changes.

% We add the new evaluation metrics and high dimensions
First, we supplement the existing evaluation by $\KL^1, \KL^{10}$ to also evaluate according to $\KL^{10, \kappa=2}$.
Then, we vary the input dimension of the problem (which is fixed at $D=2$ in the original Neural Testbed release).
To account for the different data requirements in higher dimensions we similarly increase the number of training pairs in low, medium, high data regimes to scale with the input dimension.

% We then define the full sweep
The full testbed sweep is defined over input dimensions $D \in \{2, 10, 100\}$, 
number of training pairs $T = \lambda D$ for $\lambda \in \{1, 10, 100, 1000\}$,
temperature $\rho \in \{0.01, 0.1, 0.5\}$ with 5 random seeds in each setting.
We push full details, together with opensource implementation, to Appendix \ref{app:neural_testbed}.



% \begin{outline}
% \1 Give some overview about what the neural testbed is.
    
% \1 Outline the problems and sweep.
    
% \1 Outline the benchmark agents that we sweep over.
% \end{outline}

% We use the same generative process as \citep{osband2021evaluating}. For a generative model, the Testbed samples a 2-hidden-layer 
% MLP with $2$ output units, which are scaled by $1/{\rm temperature}$ beforing passing through a softmax layer to produce class probabilities. The MLP is sampled according to standard Xavier initialization \citep{glorot2010understanding}, with the exception that biases in the first layer are drawn from $N(0, \frac{1}{2})$. The inputs $(X_t:t=0,1,\ldots)$ are drawn i.i.d. from $N(0, I_d)$. The agent is provided with the data generating process as prior knowledge. We use the kappa2 metric, described in Section \ref{}, as the evaluation metric.

% We sweep over the following parameters in the Testbed. We use input dimension $d \in \{2, 10, 100 \}$, temperature from $\{0.01, 0.1, 0.5\}$, data ratio which is the ratio between training data size and input dimension is chosen from $\{1, 10, 100, 1000\}$. We evaluate agents on both marginal $\tau=1$ and joint $\tau=10$ predictions, using kappa2 metric.

% In terms of agents we test {\tt mlp}, {\tt ensemble}, {\tt dropout},  {\tt bbb},  {\tt hypermodel},  {\tt ensemble+} and {\tt sgmcmc} agents. The more details about these agents can be found in the appendix. 



%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Benchmark agents
\subsection{Benchmark agents}
\label{sec:testbed_agents}

% \begin{outline}
% \1 Discuss the agents which we use on the testbed.
% \end{outline}

To compare the performance of benchmark agents we make use of the opensource agents developed by \citet{osband2022neural}.
Table~\ref{tab:agent_summary} lists agents that we study and compare as well as hyperparameters that we tune.
In our experiments, we optimize these hyperparameters via grid search.
The choices start from the defaults released in \github, but extend and tweak some hyperparmeter choices for high dimensional problems.
Further detail on these agents is provided in Appendix \ref{app:testbed_agents}.



%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Testbed high
\subsection{Overall results}
\label{sec:testbed_results}


% \begin{outline}

% \1 Motivate the kappa2 metric with ensemble and ensemble+ agents. Use Figure \ref{fig:testbed_input_dim}. We know from downstream tasks in \citep{osband2021evaluating} that ensemble+ should be better.
Figure \ref{fig:testbed_input_dim} shows the KL estimates for these agents, normalized so that the baseline MLP has a score of 1.
In each case, these agents are tuned for performance on the Neural Testbed for input dimension 100.
We can see that in this setting evaluation in $\KL^{10}$ is statistically indistinguishable from that of marginal predictions.
We also see that, for the most part, the quality of these marginal predictions is not massively improved versus the MLP.
However, unlike the 2D testbed results, we do see that some of these more advanced approaches \textit{can} improve marginal predictions.


% \1 $\KL^\tau$ with global input distribution fails to offer insights beyond marginal with $\tau=10$ as shown in \ref{fig:testbed_global_kappa_100D}, unlike the kappa2 metric.
However, we see that evaluating agents according to dyadic sampling leads to massive distinctions in their evaluations.
Interestingly, these qualitative results match the $\KL^{10}$ with i.i.d. test sampling ordering in the 2D setting.
\citet{osband2022neural} showed that this order was highly correlated with performance in sequential decision problems, even for high input dimension.
Our results provide a significant new finding;
in high dimensional problems dyadic sampling sampling can provide a more targeted signal for the suitability in downstream tasks.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Priors high dimension
\subsection{Priors in high dimensions}
\label{sec:testbed_priors}


% \1 Testbed offers a simple and clean platform to gain insights into predictive distributions of agents. For example, we observe that the number of training samples up to which ensemble+ agent performs better than an ensemble agent increases with input dimension. See Figure \ref{fig:testbed_data_benefits}.

% Ensemble vs Ensemble+ is a great comparison, know it's important for decisions but not sure if it is important at scale.
One of the most clear and interesting pairs of agents to compare is \texttt{ensemble} and \texttt{ensemble+}.
These agents are identical except for the addition of randomized, fixed prior networks.
Prior work has shown that this difference can be crucial in high-dimensional decision problems \citep{osband2018rpf, burda2018rnd}.
Comparison of joint predictions $\KL^{10}$ in 2D problems also showed a signficant difference, but only for very small training sets $T \le 30$.
The question remained, do these randomized priors provide value in large scale supervised learning?


% We show that it is important at dimension scale, and then that the amount of data it is important also scales with the dimension.
Figure~\ref{fig:testbed_input_dim} shows that, according to $\KL^{10}$ the benefits of \texttt{ensemble+} appear to evaporate for input dimensions $\ge 2$.
However, using dyadic sampling and $\kappa=2$ we can see there are huge differences in the quality of their posterior approximation that extend to high dimensional problems.
Figure~\ref{fig:testbed_data_benefits} shows that, as we increase the dimensionality of the problem, so too we increase the size of the largest training sets where prior functions afford signficant advantages.
Rather than becoming irrelevant in large problems, the importance of good inductive bias actually \textit{increases} with input dimension.

\begin{figure}[!ht]
    \centering
    \includegraphics[width=0.99\columnwidth]{figures/testbed_input_dim.pdf}
    \vspace{-3mm}
    \caption{Global input sampling separates \texttt{ensemble} from \texttt{ensemble+} only for low input dimension. Local sampling $\kappa=2$ scales to high dimensions.}
    \label{fig:testbed_input_dim}
\end{figure}

\begin{figure}[!ht]
    \centering
    \includegraphics[width=0.75\columnwidth]{figures/testbed_data_benefits.pdf}
    \vspace{-3mm}
    \caption{The benefits of \texttt{ensemble+} over \texttt{ensemble} occur in the `low data regime'. However, the amount of data that constitutes as `low data' grows with input dimension.}
    \label{fig:testbed_data_benefits}
\end{figure}


% \end{outline}



%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% REAL DATA
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Real data}
\label{sec:real_data}


% Run the agents on real data now. Insights carry over from testbed.
In this section we show that the key insights gained from the synthetic neural testbed can carry over to real datasets.
We replace the neural network generative process of Section~\ref{sec:neural_testbed} with small challenge datasets drawn from the deep learning literature.
We then tune the agents of Table~\ref{tab:agent_summary} for each of these settings and analyse the results.
We find that all agents can be tuned to perform roughly equivalently in terms of marginal predictions.
However, their performance difference greatly in terms of their joint performance as measured by dyadic sampling.
Further, agent performance on the testbed is highly correlated with performance on real datasets.

\bgroup
\begin{table*}[!t]
\caption{Summary of benchmark datasets studied, full details in Appendix~\ref{app:real_data}.}
\vspace{-4mm}
\begin{center}
{\centering
\footnotesize
 \begin{tabular}{| l | c c c c|} 
 \hline
  \rowcolor{tableHeader}
 \textcolor{white}{\textbf{dataset name}} & \textcolor{white}{\textbf{type}} & \textcolor{white}{\textbf{\# classes}} & \textcolor{white}{\textbf{input dimension}} & \textcolor{white}{\textbf{\# training pairs}} \\ [0.5ex] 
 \hline
iris & structured & 3 & 4 & 120 \\
wine quality & structured & 11& 11 & 3,918 \\
german credit numeric & structured & 2& 24 & 800 \\
mnist & image & 10 & 784 & 60,000 \\
fashion-mnist &image & 10 & 784 & 60,000 \\
mnist-corrupted/shot-noise & image & 10 & 784 & 60,000 \\
emnist/letters & image & 37 & 784 & 88,800 \\
emnist/digits & image & 10 & 784 & 240,000 \\
cmaterdb & image & 10 & 3,072 & 5,000 \\
cifar10 & image & 10 & 3,072 & 50,000 \\
 \hline
 \end{tabular}
\label{tab:dataset_summary}
}
\end{center}
\end{table*}
\vspace{-3mm}
\egroup

\begin{figure*}
\centering

\begin{minipage}[b]{.49\textwidth}
\vspace{-2mm}
\includegraphics[width=0.99\columnwidth]{figures/tau_1_real.pdf}
\vspace{-2mm}
\caption{None of the agents perform significantly better than MLP baseline in marginal likelihood.}
\label{fig:tau_1_real}
\end{minipage}
\hfill
\begin{minipage}[b]{.49\textwidth}
\vspace{-2mm}
\includegraphics[width=0.99\columnwidth]{figures/tau_10_real.pdf}
\vspace{-2mm}
\caption{The quality of joint predictions on the testbed is highly correlated with performance in real data.}
\label{fig:tau_10_real}
\end{minipage}
\end{figure*}



%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Real data and Testbed
\subsection{Problem formulation}
\label{sec:real_problem}

% \begin{outline}
% \1 Explain thought process of real data as generative model.
Progress in the field of deep learning has been driven in large part through evaluation on shared, fixed datasets \citep{krizhevsky2012imagenet}.
We repeat the analysis of Section~\ref{sec:neural_testbed} but replace the synthetic data generating process with a collection of datasets drawn from the literature \citep{TFDS}.


% Give a quick rundown of the datasets we include in our analysis
Table~\ref{tab:dataset_summary} outlines the ten datasets we include in our analysis.
We wanted to choose datasets that might provide an analagous challenge to the Neural Testbed and so selected them based on their popularity in the literature, and suitability for training with a 2-layer MLP.
For this reason, large scale challenges such as ImageNet or language modelling, which typically require different classes of models were not included in our selection \citep{deng2009imagenet}.


% Explain the procedure of how we model these datasets.
To mirror our evaluation in the Neural Testbed we begin with datasets $D^n_{T_n} = ((X_t, Y_{t+1}): t=0,..,T_n-1)$ for $n=1,..,10$.
To evaluate different data regimes we create subsampled datasets $\tilde{D}^n_T$ for $T=10, 100, 1000, T_n$ to evaluate different data regimes.
We then evaluate $\KL^{\tau,\kappa}$ in the `low temperature' limit, taking the labels in the supplied test set as probability 1 or, equivalently, the negative log-likelihood \citep{wen2022predictions}.

% \1 Take the big sweep of testbed agents and run them on the real data (small scale).
% \end{outline}
As in Section~\ref{sec:testbed_agents}, we evaluate the agents outlined in Table~\ref{tab:agent_summary} across each of these datasets in each data regime.
We then tune the hyperparameters per dataset, per data regime and aggregate the performance by taking the average over all evaluations.
This mirrors the procedure that we applied in Section~\ref{sec:neural_testbed}.
We push full details to Appendix~\ref{app:real_data}.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Joint predictions and prior functions
\subsection{Results}
\label{sec:real_results}

% \begin{outline}
% \1 For $\tau=1$ none of them really do much better after tuning.
% For $\tau=1$ none of them really do much better after tuning.
% See Figure~\ref{fig:tau_1_real}.
% \fillpara
We begin by assessing the quality of the agents' performance in marginal predictions, when averaged over all datasets, for all data regimes.
Figure~\ref{fig:tau_1_real} shows that, once agents are optimized for each setting, the differences between agents is not statistically significant.
This finding mirrors our observation in the case of synthetic data and Figure~\ref{fig:testbed_global_kappa_100D}.
These agents perform similarly at marginal prediction in the testbed, and overall they perform similarly in the real datasets as well.

% \begin{figure}[!ht]
%     \centering
%     \includegraphics[width=0.99\columnwidth]{figures/tau_1_real.pdf}
%     \vspace{-3mm}
%     \caption{After tuning, none of the agents perform significantly better than MLP baseline in marginal likelihood.}
%     \label{fig:tau_1_real}
% \end{figure}


% \1 For $\tau=10$ the best agents are the best on real data too.
Once you consider the quality of \textit{joint} predictions however, there is a significant difference in the quality of predictive distributions evaluated on real data.
Further, Figure~\ref{fig:tau_10_real} shows that this difference is highly correlated with performance on the Neural Testbed.
Agents that perform better in the setting with synthetic data also tend to perform better when evaluated on real data.
This finding is particularly significant since the differences in $\KL^{10,2}$ are quite large even for these state of the art agents.
These results provide strong indications that the issues observed in sequential decision problems \citep{osband2015bootstrapped} and synthetic data \citep{osband2022neural} can extend to real data.

% \begin{figure}[!ht]
%     \centering
%     \includegraphics[width=0.99\columnwidth]{figures/tau_10_real.pdf}
%     \vspace{-3mm}
%     \caption{After tuning, the quality of joint predictions on the testbed is highly correlated with performance in real data.}
%     \label{fig:tau_10_real}
% \end{figure}


% \1 Other details about hyperparameters and stuff can go to the appendix.
% \end{outline}
Now, in some sense the results we have presented are `non-standard' in that our evaluation includes averages over restricted-data versions of the canonical datasets in Table~\ref{tab:dataset_summary}.
We believe that this is a sensible approach if you are interested in designing learning agents that work in online decision making and are robust to different data regimes.
However, in some supervised learning settings it is more common from practitioners to care only about the `full' datasets with $T=T_n$.
In fact, the findings of Figure~\ref{fig:tau_1_real} and Figure~\ref{fig:tau_10_real} are essentially unchanged when restricting only to the `full data' regime.
That is, the differences in marginal predictions $\tau=1$ are quite minor, but the differences in $\tau=10, \kappa=2$ are extreme.
Further, that these differences in joint performance are highly correlated with agent performance in the testbed.
We push full details to Appendix~\ref{app:real_data}.




%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% CONCLUSION
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\vspace{-1mm}
\section{Conclusion}
\label{sec:conclusion}
\vspace{-2mm}

% \ian{Will return to conclusion once rest of paper is polished.}

% \begin{itemize}
%     \item Joint predictions are really important for AI decisions.
%     \item Highlight the impracticality of evaluating large $\tau$.
%     \item Dyadic sampling as a practical heuristic to address this.
%     \item Show that this clearly matters even in logistic regression.
%     \item Extend the Neural Testbed to high dimensions, when it wouldn't before.
%     \item Extend these results to real data when they wouldn't before.
%     \item Extensive opensource effort (data, agents, ...) and emphasize that value.
% \end{itemize}

% \fillpara

% Good joint predictions are important, dyadic sampling lets you see what is good.
Good predictions are essential for good decisions.
Crucially, the quality of these decisions depends on the quality of \textit{joint} predictions and not just the marginals \citep{wen2022predictions}.
In this paper, we highlight the difficulties in evaluating high-order predictive distributions that are essential for decision making.
We introduce dyadic sampling as an practical heuristic to sidestep the curse of dimensionality.


% We show that this in important in logistic regression.
We motivate dyadic sampling through a simple discrete example, and show that the key insights extend to linear and then nonlinear systems.
We show that the Neural Testbed cannot effectively scale to high dimensions with i.i.d. sampling, but that it can with dyadic sampling.
Importantly, this approach also scales to challenge datasets, and we show that testbed performance is highly correlated with real data.


% Talk about opensource and scaling up
A major contribution of our work is the opensource effort at \github.
This includes all the code used to generate the paper, and helps to provide clear and reproducible benchmarks for the community.
We believe that this paper can provide an stimulating base for future research into agents that make predictions in high-dimensional problems, and drive effective AI systems.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% BIBLIOGRAPHY
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

% In the unusual situation where you want a paper to appear in the
% references without citing it in the main text, use \nocite
% \nocite{langley00}

\newpage
\bibliography{references}
% \bibliographystyle{icml2022}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% APPENDIX
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\newpage
\appendix
\onecolumn

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% More Detailed Explanations
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Evaluating predictive distributions}
\label{app:theory}


This section contains supplementary material for Section~\ref{sec:theory}.
Importantly, we provide the proof for Proposition~\ref{prop:small_tau}and discuss why dyadic sampling is sufficient for Gaussian process.

\subsection{Proof for Proposition~\ref{prop:small_tau}}
\label{app:proof_small_tau}

\setcounter{proposition}{0}
\smalltau

\begin{proof}
Note that by definition, $\KL^\tau \geq \KLBAR^\tau$. We now prove that $\KL^\tau \leq \KLBAR^\tau +  O\left( \tau^3/M \right)$. Note that
\begin{equation}
    \KL^\tau = \E \left[ \log \left (\Prob (Y_{1:\tau} | \environment, X_{0:\tau-1}) \right)  \right] - \tau \log \left( \frac{1}{2} \right), \nonumber
\end{equation}
where $\tau \log \left( \frac{1}{2} \right)$ is the log-likelihood under the uniform agent, and
\begin{equation}
\KLBAR^\tau = \E \left[ \log \left (\Prob (Y_{1:\tau} | \environment, X_{0:\tau-1}) \right)  \right] -
\E \left[ \log \left (\Prob (Y_{1:\tau} |  X_{0:\tau-1}) \right)  \right].
\nonumber
\end{equation}
Consequently, we have
\begin{equation}
    \KL^\tau - \KLBAR^\tau  = \tau \log(2) +
    \E \left[ \log \left (\Prob (Y_{1:\tau} | X_{0:\tau-1}) \right) \right].  \nonumber
\end{equation}
We define the event $\mathcal{G}$ as 
\[
\mathcal{G} = \{\textrm{there are no repeated inputs in $X_{0:\tau-1}$} \}.
\]
One key observation is that conditioning on $\mathcal{G}$, the posterior predictive distribution is i.i.d. across inputs, and
\[
\log \left (\Prob (Y_{1:\tau} | X_{0:\tau-1}) \right) = - \tau \log(2)
\]
conditioning on $\mathcal{G}$. Hence 
\begin{eqnarray}
    \E \left[ \log \left (\Prob (Y_{1:\tau} | X_{0:\tau-1}) \right) \right]
    &=& - \Prob(\mathcal{G}) \tau \log(2)
    + \Prob(\bar{\mathcal{G}}) \E \left[ \log \left (\Prob (Y_{1:\tau} | X_{0:\tau-1}) \right) \middle | \bar{\mathcal{G}} \right] \nonumber \\
    &\leq& - \Prob(\mathcal{G}) \tau \log(2) \nonumber
\end{eqnarray}
where $\bar{\mathcal{G}}$ is the complement of $\mathcal{G}$, and the inequality follows from 
$\log \left (\Prob (Y_{1:\tau} | X_{0:\tau-1}) \right) \leq 0$. Hence we have
\[
\KL^\tau - \KLBAR^\tau \leq (1 - \Prob(\mathcal{G})) \tau \log (2) = \Prob (\bar{\mathcal{G}}) \tau \log (2).
\]
Finally, note that 
\begin{align}
\Prob(\mathcal{G}) = \, \prod_{k=1}^{\tau-1} \left ( 1 - \frac{k}{M} \right) 
= \, 1 - \frac{1}{M} \sum_{k=1}^{\tau-1} k + O \left( \frac{1}{M^2} \right ) 
= \, 1 - \frac{\tau (\tau-1)}{2M} + O \left( \frac{1}{M^2} \right ) . \nonumber
\end{align}
Hence $\Prob(\bar{\mathcal{G}}) = O \left(\tau^2/M \right)$ and we have
\[
\KL^\tau - \KLBAR^\tau \leq O \left(\tau^3/M \right).
\]
The conclusion follows from
\[
\KLBAR^\tau \leq \KL^\tau \leq \KLBAR^\tau + O \left(\tau^3/M \right).
\]
\end{proof}

\subsection{Dyadic sampling and Gaussian processes}
\label{app:gaussian_process}

In this section, we discuss why dyadic sampling is sufficient for Gaussian processes (GPs).
In particular, we show that when both the environment $\environment$ and the imagined environment $\hat{\environment}$ of an agent follow GP, then with sufficiently large $\tau$ and under suitable regularity conditions, performing well under $\KL^{\tau, \kappa=2}$ is sufficient to ensure that the posterior distribution of $\environment$ and the agent's belief over $\hat{\environment}$ are close.

Assume that both $\environment$ and $\hat{\environment}$ are GPs with the same finite domain $\mathcal{X}$ and that the training input distribution is uniform over $\mathcal{X}$. Specifically, under the environment $\environment$,
\[
Y_{t+1} = f(X_t) + W_{t+1},
\]
and under the imagined environment 
$\hat{\environment}$,
\[
\hat{Y}_{t+1} = \hat{f}(X_t) + \hat{W}_{t+1},
\]
where $W_{t+1}$'s and $\hat{W}_{t+1}$'s are i.i.d. observation noises according to $N(0, \sigma^2)$, and 
$f$ and $\hat{f}$ are functions over $\mathcal{X}$. We assume that
$\Prob(f \in \cdot | \data_T) = N(\mu, \Sigma)$ and $\Prob(\hat{f} \in \cdot | \theta_T) = N(\hat{\mu}, \hat{\Sigma})$. Note that by definition
\begin{small}
\begin{align}
\KL^{\tau, \kappa=2} =& \E \left[ \E \left[ \KL \left(P^*_{T+1:T+\tau} \middle \| \hat{P}_{T+1:T+\tau} \right) \middle | X_{T:T+\tau-1} = \tilde{X}_{T:T+\tau-1}^{\kappa=2}\right] \right] \nonumber \\
=& \underbrace{\E \left[ \I \left(\environment ; Y_{T+1:T+\tau} \middle | \data_T, X_{T:T+\tau-1} = \tilde{X}_{T:T+\tau-1}^{\kappa=2} \right) \right]}_{\text{irreducible}} + 
\underbrace{
\E \left[ \E \left[ \KL \left(\overline{P}_{T+1:T+\tau} \middle \| \hat{P}_{T+1:T+\tau} \right) \middle | X_{T:T+\tau-1} = \tilde{X}_{T:T+\tau-1}^{\kappa=2}\right] \right]}_{\KLTILDE^{\tau, \kappa=2}}. \nonumber
\end{align}
\end{small}
Note that the first term in the above equation is irreducible and independent of the agent, hence, performing well under $\KL^{\tau, \kappa=2}$ is equivalent to performing well under $\KLTILDE^{\tau, \kappa=2}$. Under suitable regularity conditions, for sufficiently large $\tau$, we have
\[
 \KLTILDE^{\tau, \kappa=2} \approx \E \left[ 
\KL \left( \Prob \big (f(\tilde{X}_{1:2} ) \in \cdot  | \data_T, \tilde{X}_{1:2}
\big ) \, \middle \| \,
\Prob \big (\hat{f}(\tilde{X}_{1:2} ) \in \cdot  | \theta_T, \tilde{X}_{1:2}
\big )
\right)
\right ],
\]
where $\tilde{X}_{1:2} = \big ( \tilde{X}_1, \tilde{X}_2 \big)$ and $\tilde{X}_1$ and $\tilde{X}_2$ are i.i.d. sampled from $P_X$. Thus, if the RHS of the above equation is small, then it implies that
\begin{equation}
\KL \left( \Prob \big (f(\tilde{X}_{1:2} ) \in \cdot  | \data_T, \tilde{X}_{1:2}
\big ) \, \middle \| \,
\Prob \big (\hat{f}(\tilde{X}_{1:2} ) \in \cdot  | \theta_T, \tilde{X}_{1:2}
\big )
\right)
\label{app:eq:dyadic_kl}
\end{equation}
is small for all $\tilde{X}_{1:2}$. Let $\mu (\tilde{X}_{1:2}) \in \Re^2$ and $ \Sigma (\tilde{X}_{1:2}) \in \Re^{2 \times 2}$ respectively denote $\mu$ and $\Sigma$ restricted to $\tilde{X}_{1:2}$, and $\hat{\mu} (\tilde{X}_{1:2})$ and $ \hat{\Sigma} (\tilde{X}_{1:2}) $ are defined similarly, then we have \[
f(\tilde{X}_{1:2} ) \sim N \big(\mu (\tilde{X}_{1:2}), \Sigma (\tilde{X}_{1:2}) \big) \quad \text{and} \quad \hat{f} (\tilde{X}_{1:2} ) \sim N \big (\hat{\mu} (\tilde{X}_{1:2}), \hat{\Sigma} (\tilde{X}_{1:2}) \big). 
\]
Consequently, if equation~\ref{app:eq:dyadic_kl} is small, then $\mu (\tilde{X}_{1:2})$ is close to $\hat{\mu} (\tilde{X}_{1:2})$ and $\Sigma (\tilde{X}_{1:2})$ is close to $\hat{\Sigma} (\tilde{X}_{1:2})$. Since this holds for all $\tilde{X}_{1:2}$, this further implies that $\mu$ is close to $\hat{\mu}$ and $\Sigma$ is close to $\hat{\Sigma}$. In other words, the posterior distribution of $\environment$ and the agent's belief over $\hat{\environment}$ are close.



%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Logistic Regression
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Logistic regression}
\label{app:logistic}

This appendix provides supplementary details for Section~\ref{sec:logistic_regression}.
We include all of the code necessary to generate Figures~\ref{fig:logistic_tau_scaling} and \ref{fig:logistic_agents} as part of our opensource submission \github.
Results are averaged over 10 random seeds per problem setting.


Figure~\ref{fig:logistic_ratio_tau} provides another kind of insight to the scaling observed in Figure~\ref{fig:logistic_tau_scaling}.
In these plots we show the KL ratio of a perfect \texttt{prior} agent when compared to \texttt{uniform}.
We can see that, for any input dimension, the empirical KL ratio decreases with $\tau$.
However, as the input dimension grows reasonably large $(D=10)$, that even large $\tau=10,000$ are not enough to observe this ratio under 0.5.
We know that, as $\tau \rightarrow \infty$ this ratio will tend to zero for these two agents.
By contrast, dyadic sampling is able to clearly distinguish these agents even for moderate values of $\tau$.

\begin{figure}[!ht]
    \centering
    \includegraphics[width=0.75\columnwidth]{figures/logistic_ratio_tau.pdf}
    \vspace{-3mm}
    \caption{Global input sampling can eventually separate prior samples from uniform, but the required $\tau$ grows exponentially with input dimension. Local $\kappa=2$ sampling can distinguish these agents without exponential $\tau$.}
    \label{fig:logistic_ratio_tau}
\end{figure}

Figure~\ref{fig:logistic_agent_samples} provides some insight to the robustness of Algorithm~\ref{alg:kl-computation} under varying number of agent samples.
We make use of the \textit{epistemic neural network} notation introduced by \citet{osband2021epistemic}.
We can see that these monte carlo estimates converge empirically as we increase the number of samples.
Therefore, for the purposes of our experiments in this section our choice of $10,000$ ENN samples is sufficient.

\begin{figure}[!ht]
    \centering
    \includegraphics[width=0.95\columnwidth]{figures/logistic_agents_samples.pdf}
    \vspace{-3mm}
    \caption{For the agents that we consider, 10,000 ENN samples is sufficient to get reasonable KL estimates across all input dimensions.}
    \label{fig:logistic_agent_samples}
\end{figure}








%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Neural Testbed
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Neural Testbed}
\label{app:neural_testbed}

This appendix provides supplementary details for Section~\ref{sec:neural_testbed}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Problem formulation.
\subsection{Problem formulation}
\label{app:testbed_problem}

We build on the opensource code of the Neural Testbed \gitpublic.
Our testbed sweep is defined over input dimensions $D \in \{2, 10, 100\}$, 
number of training pairs $T = \lambda D$ for $\lambda \in \{1, 10, 100, 1000\}$,
temperature $\rho \in \{0.01, 0.1, 0.5\}$ with 5 random seeds in each setting.
We replace the $\KL^{10}$ evaluation with dyadic sampling $\KL^{10, \kappa=2}$.
We release all of our code and implementation at \github.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Agentss
\subsection{Benchmark agents}
\label{app:testbed_agents}

We make use of the benchmark agents introduced in \citet{osband2022neural} and opensourced at \gitpublic. Since our testbed includes settings with number of training pairs as small as 2 (when $D=2$, $\lambda=1$) and as large as 100,000 (when $D=100$, $\lambda=1000$), in order to improve agent performance over all settings, we allow agents to adjust their number of training steps based on the problem setting. Agents implementation can be found in our open source code under the path \url{/agents/factories}.

We make small alterations to the tuning sweeps proposed in \citet{osband2022neural} in an effort to improve agent performance in high dimension problems.
This change strictly improved the agent performance as we only \textit{added} hyperparameter choices and did not restrict them.
Our sweeps can be found in our open source code under the path \url{/agents/factories/sweeps/testbed}, but we highlight the differences that helped to improve agent performance. For \textbf{\texttt{mlp}}, \textbf{\texttt{ensemble}}, \textbf{\texttt{dropout}}, \textbf{\texttt{bbb}}, \textbf{\texttt{hypermodel}}, \textbf{\texttt{ensemble+}} agents, we found out that their performance improves by allowing them to adjust their default number of training steps based on the problem setting: increase it by 5x when $\lambda=1000$ and decrease it by 5x when $\lambda=1$. For \textbf{\texttt{sgmcmc}} agent, we found out that we can improve the performance of this agent by allowing it to increase prior variance parameter by 2x when $D=100$.



%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Overall results
\subsection{Overall results}
\label{app:testbed_overall}

Figure~\ref{fig:testbed_global_kappa_100D} provides an overview of the agent performance on the testbed in terms of $\KL$.
These numbers are normalized so that the baseline MLP has a value of 1.
In classification problems it is common to also consider the classification accuracy, or the percentage of inputs for which the agent correctly labels the input.
Figure~\ref{fig:final_acc} confirms that, after tuning, none of the agents perform significantly differently from baseline MLP.


\begin{figure}[!ht]
    \centering
    \includegraphics[width=0.8\columnwidth]{figures/final_acc.pdf}
    \vspace{-3mm}
    \caption{After tuning, none of the agents perform signficantly differently from the baseline MLP in terms of classification accuracy.}
    \label{fig:final_acc}
\end{figure}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Agents
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Real data}
\label{app:real_data}

This section provides supplementary details regarding the experiments in Section \ref{sec:real_data}.
As before, we include full implementation and source code in our open source code under the path \url{/real_data}.



%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Results
\subsection{Problem formulation}
\label{app:real_problem}

% Preprocessing
Table \ref{tab:dataset_summary} outlines the datasets included in our experiments.
For each dataset, we perform a standard preprocessing on inputs to be mean zero and unit variance.
Full details are available  in our open source code under the path \url{/real_data/utils.py}.


% Temperature
In the testbed we are able to evaluate a wide range of SNR regimes by varying temperature.
This means that we can query a given input $X_t$ multiple times and potentially obtain different class labels $Y_t$.
For these fixed dataset there is only one testing dataset, with deterministic labels given for each input.
We map this setting to the low temperature limit (and high SNR) setting of our testbed.
As such, we evaluate the negative log-likelihood in place of $\KL^\tau$.
This is equivalent to assuming the underlying world model was deterministic at these testing points, and is standard practice in deep learning.

% Benchmark agents are able to do well
We note that this `high SNR' assumption appears to be reasonable in practice, since for all of the datasets considered in Table~\ref{tab:dataset_summary} the benchmark \textbf{\texttt{mlp}} agent is able to obtain high classification accuracy on held out data.
This would not be possible if the underlying system was fundamentally stochastic, due to the irreducible error due to chance.



% which evaluates agents over a range of SNR regimes, these datasets are generally all high SNR regime.
% We can see this since the top-performing agents in the literature are able to obtain high levels of classification accuracy on held out data;
% something that is impossible if the underlying system has high levels of noise.



%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Results
\subsection{Results}
\label{app:real_testbed}

In this section we provide some supplementary results that analyze the performance of our benchmark agents on real data. To allow for hyperparameter tuning separately on the testbed and real datasets, we included different sweeps for the testbed and real datasets. Our sweeps for real data can be found in our open source code under the path \url{/agents/factories/sweeps/real_data}.


One of the headline results in our paper is Figure~\ref{fig:tau_10_real}, which shows that the quality of joint predictions on the testbed is highly correlated with performance in real data.
Figure~\ref{fig:tau_10_real_high_data} shows that this result is still true when you restrict the evaluation to the `full training data' setting in each dataset.
Further, this aggregate correlation is not driven by just one outlier dataset, but actually occurs in each dataset individually.
In fact, after bootstrapping only the results on Iris were not significant at the 95\% confidence levels.
This gives some additional reassurance that the relationship between joint performance on testbed and real data is robust.

\begin{figure}[!ht]
    \centering
    \includegraphics[width=0.99\columnwidth]{figures/tau_10_real_high_data.pdf}
    \vspace{-3mm}
    \caption{The quality of joint predictions on the testbed is highly correlated with performance in real data.}
    \label{fig:tau_10_real_high_data}
\end{figure}


Our results in this paper allow for hyperparameter tuning separately on the testbed and real datasets.
We believe that this is reasonable practice, and reflects the way machine learning algorithms are usually used in practice.
However, one natural question might be if tuning an agent's performance on the testbed leads to good hyperparameter settings on real data.
Figure~\ref{fig:agents_hypers_corr} shows the results of this analysis across a wide range of agent-hyperparameter pairs.
Agent-hyperparameter pairs that perform better on the testbed generally also perform better on real data.
This result is statistically significant in both $\tau=1$ and $\tau=10$ dyadic sampling.
However, we do see a stronger correlation in joint predictions rather than marginals.
So while we do not necessarily recommend tuning your agent for real datasets using the Neural Testbed, these results say that it will provide a better answer on average than random chance.


\begin{figure}[!ht]
    \centering
    \includegraphics[width=0.85\columnwidth]{figures/agents_hypers_corr.pdf}
    \vspace{-3mm}
    \caption{Agent-hyperparameter pairs that perform better on the testbed generally also perform better on real data.}
    \label{fig:agents_hypers_corr}
\end{figure}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%



\end{document}


% This document was modified from the file originally made available by
% Pat Langley and Andrea Danyluk for ICML-2K. This version was created
% by Iain Murray in 2018, and modified by Alexandre Bouchard in
% 2019 and 2021. Previous contributors include Dan Roy, Lise Getoor and Tobias
% Scheffer, which was slightly modified from the 2010 version by
% Thorsten Joachims & Johannes Fuernkranz, slightly modified from the
% 2009 version by Kiri Wagstaff and Sam Roweis's 2008 version, which is
% slightly modified from Prasad Tadepalli's 2007 version which is a
% lightly changed version of the previous year's version by Andrew
% Moore, which was in turn edited from those of Kristian Kersting and
% Codrina Lauth. Alex Smola contributed to the algorithmic style files.


