 \documentclass[accepted]{uai2023} % for initial submission
%\documentclass[accepted]{uai2023} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2023} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2023} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{abbrvnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams
\usepackage{listings}
\usepackage{siunitx,array}
%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example



\usepackage{hyperref}       % hyperlinks
\usepackage{url}            % simple URL typesetting
\usepackage{booktabs}       % professional-quality tables
\usepackage{amsfonts}       % blackboard math symbols
\usepackage{nicefrac}       % compact symbols for 1/2, etc.
\usepackage{microtype}      % microtypography
\usepackage{lipsum}
\usepackage{float}
\usepackage{graphicx}
\usepackage{caption}
\usepackage{subcaption}
\usepackage{amsthm}
\usepackage{amsmath,bm}
\usepackage[noabbrev,capitalize,nameinlink]{cleveref}
\usepackage{multirow}
\usepackage{comment}
\usepackage{notoccite}
\newcommand{\citeall}[1]{\citet{#1} (\citeyear{#1})}
%\bibliographystyle{unsrtnat}
%\usepackage[shortlabels]{enumitem}
\graphicspath{ {./figs/} }

\newtheorem{theorem}{Theorem}[]
\newtheorem{proposition}{Proposition}[]
\newtheorem{claim}{Claim}[]
\newtheorem{lemma}[theorem]{Lemma}

\newcommand{\gb}{\bm{\gamma}}
\newcommand{\Tb}{\bm{\Theta}}
\newcommand{\tb}{\bm{\theta}}
\newcommand{\Jb}{\textbf{J}}
\newcommand{\Ib}{\textbf{I}}
\newcommand{\intd}{\text{d}}
\newcommand{\bb}{\textbf{b}}
\newcommand{\zb}{\textbf{z}}
\newcommand{\xb}{\mathbf{x}}
\newcommand{\Xb}{\textbf{X}}
\newcommand{\yb}{\textbf{y}}
\newcommand{\fb}{\textbf{f}}

\DeclareMathOperator*{\argmin}{\arg\!\min}
\DeclareMathOperator*{\argmax}{\arg\!\max}
\newcommand{\N}{\mathcal{N}}
\newcommand{\Ha}{\langle H \rangle}
\newcommand{\Hh}{\hat{H}}
\newcommand{\Uh}{\hat{U}}
\newcommand{\HM}{\langle H \rangle_{M}}
\newcommand{\LE}{\mathcal{L}_E}
\newcommand{\LT}{\mathcal{L}_T}
\newcommand{\K}{\mathcal{K}}
\newcommand{\Sc}{\mathcal{S}}
\newcommand{\Ub}{\mathbf{U}}
\newcommand{\rhoh}{\hat{\rho}}
\newcommand{\psib}{\boldsymbol{\psi}}
\newcommand{\rhob}{\boldsymbol{\rho}}
\newcommand{\thetab}{\boldsymbol{\theta}}
\newcommand{\bigO}{\mathcal{O}}

\newcommand{\STAB}[1]{\begin{tabular}{@{}c@{}}#1\end{tabular}}

\title{On the Role of Model Uncertainties in Bayesian Optimization}

% The standard author block has changed for UAI 2023 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1, *]{\href{mailto:<jonf@dtu.dk>?Subject=Your UAI 2023 paper}{Jonathan Foldager}{}}
\author[1,*]{Mikkel Jordahn}
\author[1]{Lars Kai Hansen}
\author[1]{Michael Riis Andersen}
% Add affiliations after the authors
\affil[1]{%
Department of Applied Mathematics and Computer Science, Technical University of Denmark
}\affil[*]{%
Shared first authorship.
}
  
  \begin{document}
\maketitle

\begin{abstract}
Bayesian Optimization (BO) is a popular method for black-box optimization, which relies on uncertainty as part of its decision-making process when deciding which experiment to perform next. However, not much work has addressed the effect of uncertainty on the performance of the BO algorithm and to what extent calibrated uncertainties improve the ability to find the global optimum. In this work, we provide an extensive study of the relationship between the BO performance (regret) and uncertainty calibration for popular surrogate models and acquisition functions, and compare them across both synthetic and real-world experiments. Our results show that Gaussian Processes, and more surprisingly, Deep Ensembles are strong surrogate models. Our results further show a positive association between calibration error and regret, but interestingly, this association disappears  when we control for the type of surrogate model in the analysis. We also study the effect of recalibration and demonstrate that it generally does not lead to improved regret. Finally, we provide theoretical justification for why uncertainty calibration might be difficult to combine with BO due to the small sample sizes commonly used.
\end{abstract}




\section{Introduction}
Probabilistic machine learning provides a framework in which it is possible to reason about uncertainty for both models and predictions \citep{ghahramani2015probabilistic}. 
It is often argued that especially in high-stakes applications (healthcare, robotics, etc.), uncertainty estimates for decisions/predictions should be a central component and that they should be well-calibrated \citep{kuleshov2022calibrated}. 
The intuition behind calibration is that the uncertainty estimates should accurately reflect reality; for example, if a classification model predicts an 80\% probability of belonging to class $A$ on 10 datapoints, then (on average) we would expect 8 of those 10 samples actually belong to class $A$. 
Likewise -- but less intuitively -- in regression, if a calibrated model generates a prediction $\mu$ and standard deviation $\sigma$, we would expect to see $p$ percent of the data lying inside a $p$ percentile confidence interval of $\mu$ \citep{busk2021calibrated}. 


Uncertainty also plays a central role in Bayesian Optimization (BO) \citep{snoek2012practical}, which will be the focus of this paper. As a sequential design strategy for global optimization, BO has several applications with perhaps the most popular ones being general experimental design \citep{shahriari2015taking} and model selection for machine learning models \citep{bergstra2011algorithms}. 

BO is most often used when the objective function is expensive (e.g. monetary, or time-consuming) or unethical to evaluate, gradients between in- and outputs are not available, noisy, and/or data acquisition is limited to few training samples \citep{agnihotri2020exploring}. 
A BO protocol works by iteratively fitting a probabilistic surrogate model to observed values of an objective function, and using a so-called acquisition function (AF) based on the surrogate model, to select where to query the objective function next. 
In AFs, there is an inherent trade-off between exploring input areas in which the surrogate model is uncertain of the underlying objective function, and exploiting areas where the surrogate model already knows that the objective value is close to optimal.
As such, it seems obvious that in order for this exploration-exploitation trade-off to be good, the probabilistic model must be well-calibrated.
It is, however, still not well-described how much calibration actually affects BO procedures.
One could imagine that if calibration leads to a better model representation of the underlying objective function, as would be the general intuition, it would be natural to expect that improving calibration via so-called \textit{recalibration} \citep{kuleshov2018accurate} will aid in finding the global optimum of that same function.



\subsection{Our Contribution}
In this paper, we set out to investigate how the model uncertainties affect BO performance by means of both numerical and theoretical perspectives. Our work is highly motivated by the general intuition and understanding in the community that BO surrogate models with better / well-calibrated uncertainty estimates will perform better (i.e. reach better final and/or total regret). In particular, our paper is concerned with studying statements such as "BO crucially relying on calibrated uncertainty estimates" \citep{springenberg2016bayesian} and that methods performing worse "due to their frequentist uncertainty estimates" \citep{deshwal2021bayesian}. But how well-calibrated do we need to be in order to achieve good BO performance? In order to investigate these questions, we provide four major contributions:

\begin{itemize}
    \item An extensive study of commonly used surrogate models and acquisition functions, where we study the resulting calibration errors and regrets to assess the relationship between calibration and regret. This includes an intervention study, where we manipulate model calibration and study the effect on regret. 
    \item We show that Deep Ensembles is superior for hyperparameter tuning using BO.
    \item An investigation of whether recalibration during the BO protocol leads to better BO performance.  

    \item Numerical and theoretical results to substantiate a discussion on the role of calibration in BO. Especially on the relationship between the number of recalibration samples and the variance of the calibration curve.
\end{itemize} 




\subsection{Related Work}
A great deal of work has been carried out for uncertainty calibration for regression models \citep{kuleshov2018accurate,song2019distribution,ovadia2019can,busk2021calibrated,nado2021uncertainty} and the useful uncertainty toolbox \citep{chung2021uncertainty} makes it easy to assess the calibration level of various models. In the very recent work by \citet{deshpande2021calibration}, a procedure for calibrating Gaussian processes (GPs) during BO was proposed. Given the small sample sizes available in BO, the idea is to use leave-one-out cross-validation and utilize the calibration algorithm proposed in earlier work by \citet{kuleshov2018accurate}. We note that potential issues might arise from this procedure as the earlier work by \cite{kuleshov2018accurate} states multiple times their approach produces calibrated forecasts "\textit{given enough i.i.d. data}". However, the data available during BO is rarely large nor independent and identically distributed (i.i.d.), and the  goal of our work is to dive deeper into this. Other research on the role of uncertainty calibration includes examples such as the work by \citet{bliznyuk2008bayesian}, where the authors propose a way of using Markov Chain Monte Carlo (MCMC) to get calibrated predictions for GPs. In \citet{belakaria2020uncertainty}, the authors investigate uncertainty-aware multi-objective (multidimensional output) BO and argue that due to the uncertainty incorporating strategy, their model outperforms state-of-the-art procedures.

\begin{table*}[ht!]
\begin{center}
\caption{BO results for experiments with synthetic data. For each of the surrogate and acquisition pairs here, we ran a total of 128 optimization problems, where each problem is repeated with 20 different seeds. For each pair, we report the mean of all $128\cdot20 = 2560$ runs and the standard error of the mean for all metrics. The instantaneous and total regret metrics are computed using eq. \eqref{eq:instant_regret} and \eqref{eq:total_regret}, respectively. ECE is the expected calibration error and is computed using eq. \eqref{eq:cal_error} and sharpness denotes the negatige entropy of the predictive distributions. Rows with Acquisition=Average (AVG) correspond to an average over all three acquisition strategies (EI, UCB, TS), but excluding random sampling (RS). Best performing configurations in each of the three sections (i.e. RS, EI+UCB+TS, AVG) are reported in bold font.}
\begin{tabular}{llrrrr}
\toprule
Surrogate & Acquisition &    Inst. Regret &     Total Regret & ECE &        Sharpness \\
\midrule
\midrule
       GP &          RS & \textbf{0.496 $\pm$ 0.018} & \textbf{67.117 $\pm$ 2.155} &     \textbf{0.005 $\pm$ 0.000} & -0.183 $\pm$ 0.012 \\
       DE &          RS & 0.508 $\pm$ 0.019 & 67.345 $\pm$ 2.194 &     0.011 $\pm$ 0.000 &   0.030 $\pm$ 0.007 \\
       RF &          RS & 0.511 $\pm$ 0.018 &  67.920 $\pm$ 2.205 &     0.006 $\pm$ 0.000 & -0.478 $\pm$ 0.016 \\
      BNN Small &    RS & 0.519 $\pm$ 0.019 &  67.990 $\pm$ 2.199 &   0.088 $\pm$ 0.001 &  1.253 $\pm$ 0.008 \\
      BNN &    RS & 0.509 $\pm$ 0.018 &  67.489 $\pm$ 2.165 &   0.105 $\pm$ 0.001 &  3.241 $\pm$ 0.000 \\
\midrule
\midrule
       GP &          EI & 0.036 $\pm$ 0.001 & 13.214 $\pm$ 0.325 &     0.016 $\pm$ 0.000 & -0.224 $\pm$ 0.012 \\
       DE &          EI & 0.043 $\pm$ 0.002 & 21.714 $\pm$ 0.524 &   0.029 $\pm$ 0.001 & -0.353 $\pm$ 0.009 \\
       RF &          EI & 0.099 $\pm$ 0.004 & 33.511 $\pm$ 0.994 &     0.025 $\pm$ 0.000 & -0.386 $\pm$ 0.016 \\
      BNN Small &          EI & 0.848 $\pm$ 0.026 & 91.221 $\pm$ 2.719 &   0.113 $\pm$ 0.001 &  0.602 $\pm$ 0.008 \\
      BNN &    EI & 0.755 $\pm$ 0.024 &  87.944 $\pm$ 2.620 &   0.110 $\pm$ 0.001 &  3.221 $\pm$ 0.000 \\
\midrule
       GP &         UCB & \textbf{0.027 $\pm$ 0.001} & \textbf{12.829 $\pm$ 0.328} &     0.017 $\pm$ 0.000 & -0.322 $\pm$ 0.012 \\
       DE &         UCB & 0.046 $\pm$ 0.002 & 21.148 $\pm$ 0.508 &   0.028 $\pm$ 0.001 & -0.375 $\pm$ 0.009 \\
       RF &         UCB & 0.081 $\pm$ 0.003 & 31.173 $\pm$ 0.945 &     0.025 $\pm$ 0.000 & -0.404 $\pm$ 0.016 \\
      BNN Small &         UCB &  0.480 $\pm$ 0.016 &  64.604 $\pm$ 1.830 &   0.097 $\pm$ 0.001 &  0.861 $\pm$ 0.007 \\
      BNN &    UCB & 0.734 $\pm$ 0.023 &  86.777 $\pm$ 2.595 &   0.110 $\pm$ 0.001 &  3.221 $\pm$ 0.000 \\
\midrule
       GP &          TS & 0.041 $\pm$ 0.003 & 28.729 $\pm$ 1.044 &      \textbf{0.010 $\pm$ 0.000} & -0.436 $\pm$ 0.011 \\
       DE &          TS & 0.042 $\pm$ 0.002 & 22.116 $\pm$ 0.508 &   0.027 $\pm$ 0.001 & -0.333 $\pm$ 0.009 \\
       RF &          TS & 0.279 $\pm$ 0.013 & 51.166 $\pm$ 1.783 &     0.013 $\pm$ 0.000 & -0.451 $\pm$ 0.015 \\
      BNN Small &          TS & 0.628 $\pm$ 0.021 &  76.086 $\pm$ 2.330 &   0.091 $\pm$ 0.001 &  0.997 $\pm$ 0.007 \\
      BNN &    TS & 0.519 $\pm$ 0.019 &  68.111 $\pm$ 2.225 &   0.105 $\pm$ 0.001 &  3.242 $\pm$ 0.000 \\
\midrule
\midrule
       GP &         AVG & \textbf{0.035 $\pm$ 0.001} &  \textbf{18.257 $\pm$ 0.390} &     \textbf{0.015 $\pm$ 0.000} & -0.327 $\pm$ 0.007 \\
       DE &         AVG & 0.044 $\pm$ 0.001 & 21.659 $\pm$ 0.296 &     0.028 $\pm$ 0.000 & -0.354 $\pm$ 0.005 \\
       RF &         AVG & 0.153 $\pm$ 0.005 & 38.616 $\pm$ 0.757 &     0.021 $\pm$ 0.000 & -0.414 $\pm$ 0.009 \\
      BNN Small &         AVG & 0.652 $\pm$ 0.013 & 77.303 $\pm$ 1.346 &     0.100 $\pm$ 0.001 &   0.820 $\pm$ 0.005 \\
      BNN &         AVG & 0.669 $\pm$ 0.013 & 80.944 $\pm$ 1.439 &     0.108 $\pm$ 0.000 &   3.228 $\pm$ 0.000 \\
\bottomrule
\end{tabular}
    \label{tab:synth_results_table}
\end{center}
\end{table*}






\section{Background}
Bayesian Optimization (BO) is concerned with the optimization task of finding the global minimum $\xb^*=[x_1^*,x_2^*,...,x_D^*]^\top$ of some objective function $f(\xb)$, where $\xb$ is a $D$-dimensional vector, i.e.
\begin{equation}
    \xb^* = \argmin f(\xb).
\end{equation}
We assume that the optimization objective $f(\xb) \in \mathbb{R}$ is contaminated with noise, i.e. we observe $y(\xb) = f(\xb) + \epsilon$, where $\epsilon$ is additive noise often assumed to follow an isotropic normal distribution. 
In many scenarios such as hyperparameter tuning of neural networks, the set of input variables $\xb$ are rarely all real-valued, and often no closed-form expression for $f$ exists. Hence, BO is well-suited when $f$ is a so-called "black-box" function \citep{turner2021bayesian}. 
At least two crucial decisions are to be made when using BO in practice: 1) the choice of surrogate model, which is to learn the underlying objective function $f$, and 2) the acquisition function (AF), which controls the strategy for deciding which input $\xb$ to sequentially pick by maximizing the AF. Popular choices for surrogate models include Gaussian Processes (GPs) \citep{rasmussen2003gaussian, snoek2012practical} and Random Forests (RFs) \citep{bergstra2011algorithms}, but any model with a probabilistic interpretation, e.g. Deep Ensembles (DEs) \citep{lakshminarayanan2017simple} or mean-field Bayesian Neural Networks (BNNs) \citep{springenberg2016bayesian}, can be used.

\paragraph{Acquisition Functions} For the choice of AF, \textit{Expected Improvement} (EI) as proposed by \citet{Jones1998-cu} is often used and is defined as follows:
\begin{equation}
            \text{EI}(\xb) =
            (\mu(\xb) - f(\xb^+))\Phi(Z)+\sigma(\xb)\phi(Z), \label{eq:ei}
\end{equation}
if $\sigma(\xb) > 0$ otherwise $\text{EI}(\xb) = 0$, and with $Z(\xb) = \frac{\mu(\xb) - f(\xb^+)}{\sigma(\xb)}$, where $\mu(\xb)$ and $\sigma(\xb)$ denote the mean and standard deviation, respectively, of the surrogate function at $\xb$, $f(\xb^+)$ denotes the best function value observed so far, and $\Phi$ and $\phi$ denote the cumulative distribution function (CDF) and probability density function (PDF) of a standard normal distribution, respectively. Another popular AF is the \textit{Upper Confidence Bound} (UCB), proposed in \cite{Srinivas_2012} which is defined as:
\begin{equation}
    \text{UCB}(\xb) = -\mu(\xb)+\beta^{1/2}\sigma(\xb),
\end{equation}

for minimization problems, where $\mu(\xb)$ and $\sigma(\xb)$ once again denote the mean and standard deviation of the surrogate function at $\xb$ and $\beta$ is a hyperparameter controlling the trade-off between exploitation and exploration. Finally, the acquisition strategy coined \textit{Thompson Sampling} \citep{Thompson1933ONTL} works by generating a random sample from the posterior of $f$ and then locating the optimal value for the specific sample, i.e. for some sample $f(\xb) \sim p(f|\text{Data})$
\begin{equation}
    \text{TS}(\xb) = -f(\xb).
\end{equation}

For GPs and BNNs this is done by sampling a function from the posterior, whilst for DEs and RFs we sample a neural network or tree, respectively (\cite{DBLP:journals/corr/ElmachtoubMOP17}).



\begin{table*}[ht!]
\begin{center}
\caption{BO results for hyperparameter tuning experiments.  For each of the surrogate and acquisition pairs here, we ran a total of 6 optimization problems, where each problem is repeated with 100 different seeds. For each pair, we report the mean of all $6\cdot100 = 600$ runs and the standard error of the mean for all metrics. The instantaneous and total regret metrics are computed using eq. \eqref{eq:instant_regret} and \eqref{eq:total_regret}, respectively. ECE is the expected calibration error and is computed using eq. \eqref{eq:cal_error} and sharpness denotes the negative entropy of the predictive distributions. Rows with Acquisition=Average (AVG) correspond to an average over all three acquisition strategies (EI, UCB, TS), but excluding random sampling (RS). Best performing configurations in each of the three sections (i.e. RS, EI+UCB+TS, AVG) are reported in bold font.}
\begin{tabular}{llrrrr}
\toprule
Surrogate & Acquisition &      Inst. Regret &      Total Regret &               ECE &          Sharpness \\
\midrule
\midrule
   GP &   RS & 0.0151 $\pm$ 0.0006 & 2.7021 $\pm$ 0.0995 & \textbf{0.0055 $\pm$ 0.0001} & -0.7762 $\pm$ 0.0138 \\
   DE &   RS & 0.0161 $\pm$ 0.0007 & 2.7822 $\pm$ 0.1033 & 0.0093 $\pm$ 0.0001 & -0.2574 $\pm$ 0.0134 \\
   RF &   RS & 0.0152 $\pm$ 0.0007 & 2.6977 $\pm$ 0.1018 & 0.0072 $\pm$ 0.0002 &  1.0302 $\pm$ 0.1017 \\
  BNN Small &   RS &  \textbf{0.0150 $\pm$ 0.0007} & \textbf{2.5948 $\pm$ 0.0942} & 0.1015 $\pm$ 0.0005 &  1.3499 $\pm$ 0.0102 \\
  BNN &   RS &  0.0154 $\pm$ 0.0007 & 2.7820 $\pm$ 0.1009 & 0.1075 $\pm$ 0.0005 &  3.2391 $\pm$ 0.0003 \\
\midrule
\midrule
   GP &   EI & 0.0031 $\pm$ 0.0002 & 1.5375 $\pm$ 0.0565 & 0.0153 $\pm$ 0.0004 & -0.5433 $\pm$ 0.0155 \\
   DE &   EI & \textbf{0.0011 $\pm$ 0.0001} & \textbf{0.9031 $\pm$ 0.0436} &  0.0363 $\pm$ 0.0010 & -0.2927 $\pm$ 0.0096 \\
   RF &   EI & 0.0043 $\pm$ 0.0003 & 1.0925 $\pm$ 0.0459 & 0.0146 $\pm$ 0.0004 &  0.8718 $\pm$ 0.0761 \\
  BNN Small &   EI & 0.0332 $\pm$ 0.0018 &  4.8430 $\pm$ 0.2239 & 0.1052 $\pm$ 0.0007 &  0.7928 $\pm$ 0.0136 \\
  BNN &   EI &  0.0170 $\pm$ 0.0009 & 3.1505 $\pm$ 0.1328 & 0.1092 $\pm$ 0.0005 &  3.2247 $\pm$ 0.0004 \\
\midrule
   GP &  UCB & 0.0026 $\pm$ 0.0002 &  1.5156 $\pm$ 0.0560 & 0.0149 $\pm$ 0.0004 & -0.5297 $\pm$ 0.0154 \\
   DE &  UCB & 0.0012 $\pm$ 0.0001 & 0.9159 $\pm$ 0.0437 & 0.0369 $\pm$ 0.0009 & -0.2862 $\pm$ 0.0098 \\
   RF &  UCB & 0.0043 $\pm$ 0.0002 & 1.0979 $\pm$ 0.0455 & 0.0157 $\pm$ 0.0004 &  0.9205 $\pm$ 0.0779 \\
  BNN Small &  UCB & 0.0104 $\pm$ 0.0007 & 2.6292 $\pm$ 0.1176 & 0.1013 $\pm$ 0.0006 &  1.0458 $\pm$ 0.0088 \\
  BNN &   UCB &  0.0152 $\pm$ 0.0008 & 3.1068 $\pm$ 0.1300 & 0.1093 $\pm$ 0.0005 &  3.2244 $\pm$ 0.0004 \\
\midrule
   GP &   TS & 0.0046 $\pm$ 0.0003 & 1.7544 $\pm$ 0.0643 & 0.0125 $\pm$ 0.0003 & -0.5814 $\pm$ 0.0173 \\
   DE &   TS & 0.0016 $\pm$ 0.0002 & 1.0321 $\pm$ 0.0489 & 0.0364 $\pm$ 0.0009 &   -0.2522 $\pm$ 0.0100 \\
   RF &   TS & 0.0017 $\pm$ 0.0002 & 1.3192 $\pm$ 0.0497 & \textbf{0.0101 $\pm$ 0.0002} &  0.8893 $\pm$ 0.0859 \\
  BNN Small &   TS & 0.0176 $\pm$ 0.0009 &   2.9900 $\pm$ 0.1231 & 0.1025 $\pm$ 0.0005 &  1.0644 $\pm$ 0.0091 \\
  BNN &   TS &  0.0150 $\pm$ 0.0007 & 2.6796 $\pm$ 0.0988 & 0.1075 $\pm$ 0.0005 &  3.2405 $\pm$ 0.0003 \\
\midrule
\midrule
   GP &  AVG & 0.0034 $\pm$ 0.0001 & 1.6025 $\pm$ 0.0342 & 0.0142 $\pm$ 0.0002 & -0.5515 $\pm$ 0.0093 \\
   DE &  AVG & \textbf{0.0013 $\pm$ 0.0001} & \textbf{0.9504 $\pm$ 0.0263} & 0.0365 $\pm$ 0.0005 &  -0.2770 $\pm$ 0.0057 \\
   RF &  AVG & 0.0034 $\pm$ 0.0001 & 1.1699 $\pm$ 0.0273 & \textbf{0.0135 $\pm$ 0.0002} &  0.8939 $\pm$ 0.0462 \\
  BNN Small &  AVG & 0.0204 $\pm$ 0.0007 & 3.4874 $\pm$ 0.0965 &  0.1030 $\pm$ 0.0003 &  0.9676 $\pm$ 0.0069 \\
  BNN &   AVG &  0.0157 $\pm$ 0.0005 & 2.9790 $\pm$ 0.0703 & 0.1087 $\pm$ 0.0003 &  3.2299 $\pm$ 0.0003 \\
\midrule
\midrule
     GP (recal.) &  AVG &  0.0060 $\pm$ 0.0002 &   1.8416 $\pm$ 0.0400 & 0.0149 $\pm$ 0.0002 & -0.6552 $\pm$ 0.0058 \\
   DE (recal.) &  AVG & \textbf{0.0019 $\pm$ 0.0001} &  \textbf{1.1468 $\pm$ 0.0320} & 0.0418 $\pm$ 0.0005 & -0.3123 $\pm$ 0.0042 \\
   RF (recal.) &  AVG & 0.0029 $\pm$ 0.0001 & 1.1907 $\pm$ 0.0292 & \textbf{0.0112 $\pm$ 0.0001} &   -0.5700 $\pm$ 0.0047 \\
  BNN Small (recal.) &  AVG & 0.0383 $\pm$ 0.0013 & 4.9472 $\pm$ 0.1458 & 0.0937 $\pm$ 0.0003 &  0.7728 $\pm$ 0.0136 \\
  BNN (recal.) &   AVG &  0.0157 $\pm$ 0.0005 & 3.0210 $\pm$ 0.0721 & 0.1071 $\pm$ 0.0003 &  3.1546 $\pm$ 0.0165 \\
\bottomrule
\end{tabular}
    \label{tab:real_results}
\end{center}
\end{table*}

\paragraph{Calibration}
Following the work by \citet{kuleshov2018accurate}, a regression model is well-calibrated if approximately $q$ percent of the time test samples fall inside a $q$ percent confidence interval of the predictive distribution. For regression tasks, the model calibration can be assessed using the expected calibration error
\begin{align} 
    \text{ECE} = \sum_p w_p (C_y(p) - p)^2,
\end{align}
where $C_y(p)$ is defined as
\begin{equation}\label{eq:Cy}
    C_y(p) = \frac{1}{N_T} \sum_{t=1}^{N_T} \mathbb{I} [ y_t  \leq F_t^{-1}(p) ],
\end{equation}
where $F_t^{-1}$ is the quantile function, i.e. $F_t^{-1}(p)  \equiv \inf\limits_y \{ y \: | \: p \leq F_t(y) \},$ for the $t$'th datapoint evaluated at percentile $p$, $\mathbb{I}$ is an indicator function and $w_p$ can be chosen to adjust the importance of percentiles with fewer datapoints. Throughout this paper, we assume $w_p = 1 \,\, \forall \,p$. The closer the ECE is to zero, the better calibrated the model is. 

\paragraph{Recalibration}
\citet{kuleshov2018accurate} also propose a general procedure for recalibrating any surrogate model. A so-called recalibrator model $C$ is constructed using an independent and identically distributed (i.i.d.) validation set and subsequently, applied to adjust the CDF of the model's predictive distribution $F_t$ for some observation $y_t$, i.e. the recalibrated predictive distribution is $C \circ F_t$. This is done via learning an isotonic mapping: $C: \left[0, 1\right] \rightarrow \left[0, 1\right]$ from the predicted probabilities of events of the form $\left(-\infty, y_t\right]$ to the corresponding empirical probabilities. In \cite{deshpande2021calibration}, a recalibration method for BO specifically is proposed, in which the recalibrator model is learnt via leave-one-out CV on the samples gathered during BO. After training the recalibrator model $C$, the relevant summary statistics (e.g. moments and intervals) of the recalibrated distributions can be computed numerically from $C \circ F_t$. See Alg. 1 in \citet{kuleshov2018accurate} for more details.








\section{Experiments}
In this section, we describe a collection of numerical experiments designed to study and investigate the relationship between calibration and regret. We focus our study on four popular models, namely GPs, RFs, DEs, and BNNs. For GPs, DEs, and BNNs, we assume an isotropic Gaussian likelihood and for RFs, we impose a Gaussian predictive distribution, where the mean and variance are estimated as the sample mean and variance of the tree predictions. Our experiments are based on both synthetic and real-world data: for experiments with synthetic data, we use a number of problems from the common benchmark suites for optimization called Sigopt \citep{jamil2013literature, dewancker2016stratified}, and for the real-world data, we apply BO to hyperparameter tuning of various machine learning models including feed-forward Neural Networks, Convolutional Neural Networks and SVMs used on on or more datasets such as MNIST \citep{mnist}, Fashion-MNIST \citep{fashionmnist}, AG News classification \citep{AG_news} and Wine classification \citep{wine_data}. For all experimental details, see Supplementary Material.


\begin{figure*}[ht!]
    \begin{subfigure}[b]{0.49\textwidth}
    \includegraphics[width=\textwidth]{figs/synth_tot_regret_vs_calib_seed.pdf}
    \caption{Test calibration vs regret of synthetic data experiments. }
    \end{subfigure}
    \begin{subfigure}[b]{0.49\textwidth}
    \includegraphics[width=\textwidth]{figs/tot_regret_vs_calib_seed.pdf}
    \caption{Test calibration vs regret of real data experiments}
    \end{subfigure}
    \caption{Total Regret vs. ECE for synthetic data experiments and hyperparameter tuning experiments.  The colors in the scatter plot indicate the type of surrogate model, and the marker indicates the AF used. \textbf{OBS:} Each point in the scatter plots corresponds to an average of 20 seeds in the synthetic data experiments and 100 seeds in the hyperparameter tuning experiments for each specific configuration.}
    \label{fig:regret-calibration-correlation}

\end{figure*}

\paragraph{Experimental Setup}
In the synthetic setting, we perform BO experiments on a total of 128 optimisation problems spanning input dimensions ($ D \in \{ 1, 2, .., 10\}$) from the Sigopt benchmark. For each optimisation problem, we repeat the experiment 20 times using different random initialization of both the BO routines and seeds. We do this for all combinations of surrogates and AFs, of which we use the previously mentioned EI, UCB and TS. We consistently use ten initial i.i.d. random samples followed by 90 BO iterations for all experiments. We add Gaussian distributed noise giving a SNR of 100 to all Sigopt objective functions. For reference, we also include a random sampling (RS) acquisition function.
%
In the hyperparameter tuning setting, we perform BO experiments on a total of 6 different hyperparameter tuning problems. The surrogate models and AFs are the same as in the synthetic setting, and we similarly sample 10 i.i.d. points to initiate the BO session, and then run 90 BO iterations. Here we run each experiment 100 times.

Our key performance metrics are regret, calibration error and sharpness as defined in the following.
We report the calibration error, ECE, as being the mean squared calibration error evaluated on a large i.i.d. test set ($N_{\text{test}} = 5000$) as
\begin{equation}\label{eq:cal_error}
    \text{ECE} = \frac{1}{P} \sum_{j=1}^P (C_y(p_j) - p_j)^2,
\end{equation}
where $C_y(p_j)$ is defined in eq. \eqref{eq:Cy} and for $0\leq p_1 \leq p_2 ... \leq p_P \leq 1$ as suggested by \citet{kuleshov2018accurate}. We use $P=20$ with equidistant $p_j$ values. The ECE values are reported as averages across all BO iterations.
We  quantify the BO performance using the regret metric, where we define the instantaneous regret for the last iteration $T$ as follows
\begin{align} \label{eq:instant_regret}
    \mathcal{R}_I = y_{\text{min}} - y(x^*_T),
\end{align}
where $y(x)$ is the objective function value (with added noise in the synthetic case, i.e. $y(x)=f(x)+\sigma$) obtained at point $x$, $y_{\text{min}} \equiv \min\limits_x y(x)$ is function value at the global minimum, and $x^*_T \equiv \arg\min_{x_t} \{ y(x_t) \}_{t=1}^T$ is the input value for the best observation after $T$ iterations. 
Similarly, the total regret is the sum of the instantaneous regret across all iterations
\begin{align} \label{eq:total_regret}
    \mathcal{R}_T = \sum_{i=1}^T \left[y_{\text{min}} - y(x^*_i)\right].
\end{align}

All regret values are reported after standardizing objective function values. Finally, we report the sharpness as the average negative entropy of the predictive distributions as evaluated on the test-set across all BO iterations.
%
For the choices of surrogate models, we use a GP with an RBF kernel, and optimize hyperparameters of the kernel at every BO iteration using the exact marginal likelihood \citep{rasmussen2003gaussian}. We use two different mean-field BNN architectures, a smaller (BNN Small) with a single hidden layer with 10 hidden neurons and a larger (BNN) with two hidden layers with 30 and 10 hiddens nodes respectively. Both are trained using the ELBO loss \citep{Blei19}. The DEs consists of 10 neural networks with two hidden layers and are all trained using the MSE loss and Adam optimiser \citep{kingma2014method}. Finally, the RFs have their hyperparameters tuned via CV on a grid of hyperparameters at each BO iteration. With regards to the AFs, we use EI as defined in Eq. \ref{eq:ei}, UCB with $\beta=1$, and only sample one posterior function at each BO step when using TS. See detailed experimental details and descriptions in the Supplementary Material. Code is available at \url{https://github.com/jfold/unibo}.

\paragraph{Experiment results}
The results for the synthetic and real data experiments are summarized in Tables \ref{tab:synth_results_table} and \ref{tab:real_results}, respectively. We observe that in the synthetic  setting, GPs outperform all other models both in terms of instantaneous regret and more importantly, total regret, although closely followed by DEs. RFs perform relatively well (at all times better than random sampling), whilst the BNNs exhibit poor performance and are often outperformed by random sampling. Finally, we see that the GP is best calibrated overall, and that all surrogate models have their lowest ECE when random sampling is used. This is overall not surprising as the ECE is evaluated on a large i.i.d. test set, which is more well-represented by i.i.d. training samples compared to strongly dependent samples acquired iteratively through BO. For the real-data experiments in Table \ref{tab:real_results}, we see that DEs outperform all other models in terms of both regret types, and are closely followed by both GPs and RFs which perform comparatively. Once again, GPs are the best calibrated when random sampling is employed.

\paragraph{Relationship between calibration and regret}
In order to investigate the relationship between BO performance (regret) and calibration (ECE), we first compute the Pearson correlation coefficient between the total regret values and the ECE values, which yield moderate and statistically significant coefficients of $0.28$ and $0.42$ for synthetic and hyperparameter tuning experiments, respectively (see Table \ref{tab:correlation}). The moderate positive association is also visually confirmed by the scatter plots in Figures \ref{fig:regret-calibration-correlation}. It is also evident from these plots that the type of surrogate model is important for both ECE and total regret. Therefore, we also compute the partial correlation coefficient controlling for the model type yielding $-0.06$ and $-0.24$ for synthetic and real data, respectively. Interestingly, both correlations become weaker and one statistically insignificant (at level $\alpha = 0.05$) leading to an instance of Simpson's paradox \citep{Wagner1982SimpsonsPI}. To further investigate this, we conducted a multiple linear regression analysis for total regret vs ECE controlling for both the type of model and the specific problem instance. The results for the hyperparameter tuning experiments showed that both the common slope and model-specific slopes for ECE were generally weak and statistically insignificant (see all details in the Supplementary Material). In summary, these results show that models with high ECE are generally associated with high regrets, however, this association diminishes when we control for the type of surrogate model. 
%
To further scrutinize these observations, we conduct two additional experiments: an intervention study and a recalibration study.

\begin{table}[H]
\caption{Correlation values between regret and ECE.} 
\label{tab:correlation}
\resizebox{\columnwidth}{!}{%
\begin{tabular}{l|ll}
%\hline
\textbf{} & \textbf{Synth. Data} & \textbf{Real Data} \\ \hline
\textbf{Correlation} & $\phantom{-}0.28$ ($p <10^{-8}$) & $\phantom{-}0.42$ ($p<10^{-4}$) \\ %\hline
\textbf{Partial Correlation | Model} & $-0.06$ ($p=0.19$) & $-0.24$ ($p=0.026$) \\ %\hline
%\textbf{Partial Corr.|Model, Problem} & $-0.16$ ($p=0.006$) & $-0.35$ ($p=0.005$) \\ \hline
\end{tabular}%
}
\end{table}



\paragraph{Intervention study: Perturbing Predictive Uncertainties}
In the intervention study, we explicitly manipulate calibration by perturbing the predictive uncertainty of each model during the BO protocol. Specifically, we multiply the standard deviation of the posterior distribution for all models by a constant $c \in \left[10^{-4}, 10^2\right]$ and observe the resulting effect on ECE and total regret. We conduct this experiment for the 6 different hyperparameter tuning problems using the EI acquisition function and repeat the experiment with 40 different seeds. In \cref{fig:regret-calibration-std-change} we show the calibration error (a) and total regret (b) as a function of the multiplicative constant $c$.  Several interesting observations can be made from \cref{fig:regret-calibration-std-change}. First, all models exhibit the smallest calibration error at $c > 1$, which indicates some degree of overconfidence, and thus, increasing the predictive variance slightly generally improves calibration. Interestingly, DEs and GPs are somewhat robust to these perturbations in their predictive uncertainties with regard to regret, while RFs even seem to benefit from having the uncertainties reduced. Finally, in panel (c) we plot regret vs calibration error for each value of $c$, where each marker is scaled with the size of $c$ and $c=1$ is marked with black. We have connected the dots for each surrogate function, going from smallest to largest $c$. From this plot, we observe that perturbing by $c>1$ rapidly increase both regret and ECE, but perturbations with $c < 1$ are less harmful and may actually lead to improved performance. In other words, the results from this experiment suggest that miscalibration caused by models being generally underconfident, i.e. $c > 1$, is more detrimental to BO performance compared to models being overconfident, i.e. $c < 1$. 

\paragraph{Recalibration study: Recalibration during BO}
In the recalibration study,  we investigate whether recalibrating the models during the BO protocol improves BO performance following the recalibration procedure proposed by \cite{deshpande2021calibration}. We do this by re-running our BO experiments on real data, where we use leave-one out CV on the training data obtained during BO to learn a recalibration model and adjust the resulting predictive distributions accordingly. The results can be seen in the last section in the bottom of Table \ref{tab:real_results}. Except for RFs, it is seen that both regret and ECE are generally worse \emph{after recalbration}. This may seem counter-intuitive, but then recall that we compute the recalibration model using leave-one-out on the training set, but we measure the expected calibration on an independent test set. The recalibration procedure may have improved the calibration metric on the training dataset, but in our experiments, it does not generalize to an independent test set. We note that RFs do benefit from recalibration, but this might be explained by the fact that sharpness is substantially reduced after recalibration. We will shed more light on these observations in the next section.
%
\begin{figure*}[ht!]
    \begin{subfigure}[b]{0.33\textwidth}
    \includegraphics[width=\textwidth]{figs/ece_vs_c_real_data.pdf}
    \caption{ECE after BO protocol}
    \end{subfigure} 
    \begin{subfigure}[b]{0.33\textwidth}
    \includegraphics[width=\textwidth]{figs/R_vs_c_real_data.pdf}
    \caption{Regret after BO protocol}
    \end{subfigure}
    \begin{subfigure}[b]{0.33\textwidth}
    \includegraphics[width=\textwidth]{figs/ECE_vs_regret_c_real_data.pdf}
    \caption{Regret vs. ECE for various perturbations.}
    \end{subfigure}\\
    \caption{The effect on test calibration and regret when disturbing the posterior predictive uncertainty by $c\cdot \sigma(\xb)$ during the BO protocol. (a) Shows the overall ECE of each model when a perturbation of $c\cdot \sigma(\xb)$ is done in each iteration, (b) shows the corresponding total regret, and (c) depicts how regret and calibration varies together for the same experiments. The size of the markers here indicate how large $c$ is, and the plot lines go from smallest to largest $c$. Black points are when $c=1$. \label{fig:regret-calibration-std-change}}
\end{figure*}

\vspace{-0.1cm}

\section{Discussion and Summary \label{sec:discussion}}



\vspace{-0.1cm}

In the previous section, we described and performed a number of numerical experiments to analyze the relationship between calibration and regret for BO. In this section, we will summarize and discuss some of the key take-aways as well as expand the analysis with a theoretical perspective.
\\
\\
\textbf{Take-away 1: Gaussian Processes and Deep Ensembles work well for BO.} Our results for synthetic data is consistent with the apparent consensus that GPs are strong surrogates for BO and that they outperform the competing methods in terms of regret (both total and instant) (see \cref{tab:synth_results_table}), with DEs being close followers. Surprisingly, in the hyperparameter tuning experiments, DEs perform exceedingly well, with RFs and GPs performing equally well. One should however note the practical concern that DEs is computationally more expensive to train during the BO procedure, but that this could be rationalized if such compute time is cheap relative to querying the objective function. In both experiments, the mean-field BNNs perform significantly worse than all other methods, including random search. Similar behavior has also been observed in other experimental design settings, e.g. active learning \citep{Foong2020}. In terms of ECE, the GPs performed slightly better than the RFs and DEs in the synthetic setting, whilst RFs and GPs perform comparably in the hyperparameter tuning setting. Again, we notice that the mean-field BNNs are inferior to the other methods in both experiments. 



\textbf{Take-away 2: Correlation between BO performance and calibration diminishes when controlling for the type of surrogate model.}
For the synthetic and hyperparameter experiments, our analysis showed moderate positive correlations of $0.28$ and $0.42$, respectively, between total regret and ECE, when computed across all problems, seeds, acquisition functions, and surrogate models. However, when we control for the type of surrogate model, the correlation becomes much weaker and one becomes statistically insignificant (see Table \ref{tab:correlation}). That is, within each model family, BO trials with lower calibration errors are generally not linearly associated with lower regret and in turn better BO performance. 




\textbf{Take-away 3: Under-confidence might be more harmful to BO compared to overconfidence.}
In our intervention study, we manipulated all surrogate models to be either under- or overconfident during the BO protocol by multiplying their predictive uncertainties by a constant $c > 0$, where $0 < c<1$ implies more confident predictions, and $c>1$ implies less confident predictions. The results showed that all models exhibited some degree of overconfidence, which may not be surprising. However, the results also showed that BO performance decreased (i.e. regret increased) rapidly for all models except for the larger BNN for $c>1$, whereas BO performance was much more robust to perturbations with $c < 1$, which actually caused an increase in BO performance in some cases. Only for the GP, we observed a slight temporary improvement in regret for $c>1$. It is also worth emphasizing that the value of $c$ leading to optimal calibration did not coincide with the values for optimal regret. Finally, it is evident from eq. \eqref{eq:ei} that changing $c$ also affects the effective exploitation-exploration trade-off which, in turn, may also impact the regret (the optimal trade-off is also likely to be intrinsic to the optimisation problem). This can be observed in \cref{fig:regret-calibration-std-change}, where both very small and very large values of $c$ caused the methods to behave more like random search.



\textbf{Take-away 4: Recalibration does generally not improve BO performance.}
We further investigated the potential benefit of recalibrating the surrogate models during the BO process using a leave-one-out procedure. However, in our recalibration experiments on the hyperparameter tuning datasets, the recalibration procedure only lead to improved ECE (measured on a proper independent test set) for two surrogate models, namely the small BNNs and the RFs. In the other cases, it actually worsened the ECE. Moreover, we also noticed that all models got worse total regret performance after employing the recalibration procedure. 



\textbf{Hypothesis: Calibration curves are not reliable for small sample sizes.}
Recent work by \cite{deshpande2021calibration} observed that re-calibration might aid BO by yielding smaller total regret in some trials and smaller instant regret for the BO last iteration in fewer trials. However, our experiments suggest that recalibration might actually degrade BO performance. \cite{kuleshov2018accurate} state that a sufficiently large i.i.d. validation set is a required condition for successful recalibration, which is in stark contrast to the sample collection during BO which is not i.i.d. due to the inherent sequential nature of BO algorithms and is often characterized by small sample sizes. 

To investigate this hypothesis, our starting point will be a simple regression setting, where $p_y(y|x)$ denotes the true data generating distribution of $y$ given an input $x$. We further assume a trained model with predictive distribution  $p_t(y|x)$ aiming to mimic $p_y$ via training samples. Consider now the task of assessing the calibration of model using a set of i.i.d. validation samples $\{y_1,y_2,...,y_N\}$. 
Given the small sample sizes typically used in BO, a natural question to ask is how accurately can we assess the calibration curve as a function of the size of the validation set $N$? To investigate this question, we consider the variance of the estimator in eq. \eqref{eq:Cy} and analyze its decay rate as a function of the sample size $N$. The result is summarized in the following statement:
\begin{proposition}\label{prop1}
Let $F_i$ be the CDF of the predictive distribution for the $i$'th observation and let $\{ y_i \}_{i=1}^N$ be i.i.d. samples $y_i \sim p_y$. For $\mathcal{C}_y(p) = \frac{1}{N} \sum_{i=1}^N \mathbb{I}\left[y_i \leq F_i^{-1}(p)\right]$, then the variance of $C_y(p)$ decays as $\mathbb{V}\left[\mathcal{C}_y(p)\right] = \mathcal{O}(N^{-1})$.
\label{thm:var-vs-n}
\end{proposition}
\begin{proof}
Let $\mathcal{C}_y(p) = \frac{1}{N} \sum_{i=1}^N z_i$ for $z_i \equiv \mathbb{I}\left[y_i \leq F_i^{-1}(p)\right]$. The variance of $\mathcal{C}_y(p)$ is then given by 
\begin{align*}
\mathbb{V}\left[\mathcal{C}_y(p) \right] &= \mathbb{V}\left[\frac{1}{N} \sum_{i=1}^N z_i\right]
\end{align*}
%
and by independence each of $z_i$,
\begin{align*}
\mathbb{V}\left[\mathcal{C}_y(p) \right] &\leq \frac{1}{N^2} \sum_{i=1}^N \sup\limits_i \mathbb{V}\left[z_i\right]
= \frac{1}{N^2} \sum_{i=1}^N \frac{1}{2^2}= \frac{1}{N} \frac{1}{2^2}.
\end{align*}
Hence, it follows the variance of $\mathcal{C}_y(p)$ is bounded by
\begin{align}
\mathbb{V}\left[\mathcal{C}_y(p)\right] \leq \mathcal{O}\left(N^{-1}\right).
\end{align}
See Supplementary Material for detailed proof.
\end{proof}

We also confirmed this result empirically and observe results perfectly consistent with the predictions from \cref{prop1} (see in the Supplementary Material), i.e. the maximum standard deviation of the estimator for $C_y(p)$ decays as $\frac{1}{\sqrt{N}}$.
Next, we assume our model is perfect, i.e. $p_t(y|x) = p_y(y|x)$, and ask what is the contribution to ECE caused by a small sample size alone. The results are summarized in the next two statements:
\begin{proposition}
Let $F_i$ be the CDF of the predictive distribution perfect model, i.e. $p_t(y|x) = p_y(y|x)$. If $F_i$ is strictly monotonic, it holds that $\mathbb{V}\left[\mathcal{C}_y(p)\right] = \frac{p(1-p)}{N}$ for all $p$.
\label{prop2}
\end{proposition}
\begin{proof}
In this setting, we have
\begin{align*}
z_i = \mathbb{I}\left[y_i \leq F_i^{-1}(p)\right] =\mathbb{I}\left[F_i(y_i) \leq p\right] = \mathbb{I}\left[u_i \leq p\right],
\end{align*}
where $u_i \sim \mathcal{U}\left[0, 1\right]$ are uniformly distributed on the unit interval due to the probability integral transform. Since $\{ u_i \}_{i=1}^N$ are also independent, it follows that  $S_n = \sum_{i=1}^N z_i \sim \text{Binomial}(N, p).$
%\begin{align*}
%S_n = \sum_{i=1}^N z_i \sim \text{Binomial}(N, p).
%\end{align*}
Therefore, we have
\begin{align*}
\mathbb{V}\left[\mathcal{C}_y(p)\right] = \mathbb{V}\left[N^{-1}S_N\right] = N^{-2} \mathbb{V}\left[S_N\right] = N^{-1}p(1-p).
\end{align*}
This completes the proof.
\end{proof}
%As seen in the top row of \cref{fig:n-vs-variance}, substantial variability should be expected for small sizes, and consequently, one may erroneously end up concluding that even a perfect model is miscalibrated for sufficiently small $N$. 
%To quantify this relationship futher, we provide a theoretical result for the calibration error in in this setting. 
\begin{proposition}
Let $\text{ECE} = \sum_{j=1}^P w_j (p_j - \mathcal{C}_y(p_j))^2$ be the weighted mean square calibration error. Assume $w_i \in \left[0, 1\right]$ and $0 < p_1 < p_2 < ... < p_P < 1$ are fixed, and assume the CDF of the predictive distribution is equal to the true data distribution (almost everywhere), then it holds that $\mathbb{E}\left[\text{ECE}\right] = \frac{1}{n}\sum_{j=1}^P w_jp_j(1-p_j) \propto n^{-1}$.
\label{thm:cal-err}
\end{proposition}
\begin{proof}
See Supplementary Material.
\end{proof}

\textbf{Take-away 5: Calibration curves may not be reliable for small sample sizes}
\cref{prop1} and \cref{prop2} state that the variance of the estimator of the empirical calibration decreases with $\mathcal{O}\left(N^{-1}\right)$. This suggests that empirical calibration curves may not be reliable for small sample sizes and in the worst case, to improve the accuracy of the estimates by one decimal point, one needs to increase the size of the validation set by a factor of $100$, which will often be infeasible in practical BO settings. Furthermore, \cref{thm:cal-err} states that even for a perfect model, the expected ECE is proportional to $N^{-1}$. Therefore, for small sample sizes, one should be careful when concluding that a model is mis-calibrated, since the observed ECE might as well be caused by the sample size. Even worse, when performing recalibration in this scenario, one might risk adjusting the model in the "wrong direction" causing the model to be more miscalibrated than the original model. In the Supplementary Material, we show several examples of this phenomenon. 
%
Although our empirical and theoretical analysis are focused on the i.i.d. setting, we expect the effect to be even more severe in the non-i.i.d. case since the effective sample size is typically smaller for correlated samples \citep{thiebaux1984interpretation}. Therefore, we claim that these effects may have profound impact on recalibration in BO protocols. 

\textbf{Concluding Remarks} In our experiments, we confirm the common knowledge that GPs generally work well in the BO setting, but interestingly, we also find that Deep Ensembles outperform GPs in some cases. There is the computational downside of Deep Ensembles compared to GPs, however, this overhead may be justified if the cost of evaluating the BO objective function is sufficiently expensive. Moreover, we observe that models with high ECEs are generally associated with worse performance in BO, but that this association disappears when we control for the type of surrogate model. However, we still argue that calibration is important for BO because 1) models with lower ECEs are associated with better regrets and 2) when we explicitly intervened on calibration (by manipulating the predictive uncertainty), we observed that the BO performance for all models decrease significantly. Furthermore, our experiments suggest that re-calibration during the BO protocol can hurt BO performance. Based on both theoretical and empirical evidence, we attribute this to the fact that it is really difficult to reliably assess calibration using the small (and non-i.i.d.) datasets typically used in BO. Therefore, we advocate cautiousness when using these recalibration methods for small sample sizes in practice. 

\textbf{Future work}
Our study indicates that the common way to diagnose calibration (on a large test set) might not be sensible for BO and that future studies about calibration metrics more relevant to BO are needed.

It will also be of great interest to explore the relationship between calibration and regret from a causal perspective.
Lastly, it would be  interesting to dig deeper into the effects of under- vs. over-confidence on BO performance.



%\subsubsection*{Acknowledgements}
%JF was supported by the William Demant Foundation [grant number 18-4438].



\bibliography{foldager_370} 


%\onecolumn
\setcounter{figure}{0}
\renewcommand{\thefigure}{S\arabic{figure}}
\setcounter{table}{0}
\renewcommand{\thetable}{S\arabic{table}}
\setcounter{equation}{0}
\renewcommand{\theequation}{S\arabic{equation}}

\end{document}
