\section{Introduction}\label{sec:introduction}
This paper considers the computational challenge of estimating certain intractable expectations which arise in machine learning, statistics, and beyond. Given a function $f:\calX \times \Theta \rightarrow \R$, we are interested in estimating \emph{conditional expectations} (sometimes also called parametric expectations) $I: \Theta \rightarrow \R$ uniformly over the parameter space $\Theta$, where:
\vspace{-5pt}
\begin{align*}
    I(\theta) = \E_{X \sim \mathbb{P}_\theta}[f(X,\theta)]=\int_\calX  f(x, \theta) \mathbb{P}_\theta(\mathrm{d} x), 
\end{align*}
and $\{\mathbb{P}_\theta\}_{\theta \in \Theta}$ is a family of distributions on the integration domain $\calX$. We will assume that $I(\theta)$ is sufficiently smooth in $\theta$ so that $I(\theta),I(\theta')$ are similar given close enough parameters $\theta,\theta'$, but that $I$ is not available in closed-form and must be approximated through samples and function evaluations. 

The computational challenge of approximating conditional expectations arises in many fields. It must be tackled when calculating tail probabilities in rare-event simulation \citep{Tang2013}, and when computing moment generating, characteristic, or cumulative distribution functions \citep{Giles2015,Krumscheid2018}. It also arises when computing the conditional value at risk or various valuations of options \citep{longstaff2001valuing,alfonsi2022many}, for Bayesian sensitivity analysis \citep{Lopes2011,Kallioinen2021}, or even more broadly for scientific sensitivity analysis; see for example Sobol indices \citep{Sobol2001}. Conditional expectations $I(\theta)$ are also often computed as an intermediate quantity. 
For example, given $\phi:\R \rightarrow \R$ and some probability distribution $\mathbb{Q}$ on $\Theta$, we are often interested in the \emph{nested expectation} given by $\mathbb{E}_{\theta \sim \mathbb{Q}}[\phi(I(\theta))]$ \citep{Hong2009,Rainforth2018}. This problems comes about when computing the expected information gain in Bayesian experimental design \citep{Chaloner1995}, and for computing the expected value of partial perfect information in health economics~\citep{heath2017review}.


Methods for computing $I(\theta)$ generally select $T$ parameter values $\theta_1,\cdots,\theta_T \in \Theta$, then simulate $N$ realisations from each corresponding probability distribution $\mathbb{P}_{\theta_1}, \cdots, \mathbb{P}_{\theta_T}$ at which they evaluate the integrand $f$, leading to a total of $N T$ evaluations. 
The usual approach is to use classical Monte Carlo methods to estimate $I(\theta_1), \cdots, I(\theta_T)$, but in many applications we are also interested in estimating either $I(\theta)$ for a fixed $\theta \notin \{\theta_1,\cdots,\theta_T\}$, or  $I(\theta)$ uniformly over $\theta \in \Theta$. 
As a result, a second step combining the estimates of $I(\theta_1), \cdots, I(\theta_T)$ is often required to complete the task. 

The most straightforward approach to estimating conditional expectation is importance sampling \citep{Glynn1989,Madras1999,Tang2013,Demange-Chryst2022}, where $I(\theta)$ is estimated by weighting function evaluations to account for the fact that the samples were not obtained from $\mathbb{P}_\theta$ but from the importance distributions $\mathbb{P}_{\theta_1}, \cdots, \mathbb{P}_{\theta_T}$. 
Unfortunately, this approach is only applicable when $f$ does not depend on $\theta$ (otherwise new expensive function evaluations are needed), and it is usually difficult to identify  importance distributions that can lead to an accurate estimator for small $N$ and $T$. 
Alternatively, least-squares Monte Carlo  \citep{longstaff2001valuing,alfonsi2022many} first estimates $I(\theta_1),\cdots, I(\theta_T)$ through Monte Carlo, then estimates $I(\theta)$ through linear or polynomial regression based on these estimates. These methods are therefore dependent on the accuracy of the Monte Carlo estimators and the regression method. 

Overall and in addition, there are two main limitations which all of these existing methods suffer from. Firstly, they are very sample-intensive; i.e. they require a relatively large number of function evaluations (i.e. $N$ and $T$) to reach a given level of accuracy, which makes them infeasible if sampling or evaluating the integrand is expensive. Secondly, obtaining a finite-sample quantification of uncertainty for $I(\theta)$ is often infeasible. This is a significant limitation for challenging integration problems, for which we would ideally like to know how accurate our estimator is.

To tackle these limitations, we propose a novel algorithm called \emph{conditional Bayesian quadrature} (CBQ). The name comes from the fact that our approach extends the Bayesian quadrature algorithm~\citep{Diaconis1988,OHagan1991BayesHermiteQ,Rasmussen2003,fx_quadrature} to the computation of conditional expectations. As such, CBQ falls in the line of work on probabilistic numerical methods \citep{hennig2015probabilistic,Cockayne2017BPNM,Oates2019Modern,Hennig2022}.
Our algorithm is based on a hierarchical Bayesian model consisting of two-stages of Gaussian process regression, and leads to a univariate Gaussian posterior distribution on $I(\theta)$ whose mean and variance are parametrised by $\theta$. See \Cref{fig:illustration} for an illustration. 




\begin{figure}[t]
    \centering
    \includegraphics[width=260pt]{figures/illustration.pdf}
    \caption{Illustration for \textit{conditional Bayesian quadrature} (CBQ) in \Cref{sec:cbq}. The first stage gives a GP posterior of $f(x,\theta)$ for each $\theta \in \{\theta_1, \cdots, \theta_T\}$, which are then integrated to give \textcolor{blue}{$\hat{I}_{\text{BQ}}(\theta_1), \cdots, \hat{I}_{\text{BQ}}(\theta_T)$}. The second stage then combines all BQ estimates from the first stage to give a GP posterior of $I(\theta)$: \textcolor{red}{$\hat{I}_{\text{CBQ}}(\theta)$}.
    All shared areas represent Bayesian quantification of uncertainty.}
    \label{fig:illustration}
\end{figure}


This approach allows us to mitigate the two main limitations of existing methods. Firstly, we show both theoretically and empirically that our method converges rapidly to the true value and is hence more sample efficient than baselines. This result holds under mild smoothness conditions on $f$ and $I(\theta)$ whenever the dimension of $\calX$ and $\Theta$ is not too large. As a result, a desired accuracy can be reached with smaller $N$ and $T$, and the method will therefore be preferable for expensive problems. Secondly, the fact that we have an entire posterior distribution on $I(\theta)$ allows us to provide finite-sample Bayesian quantification of uncertainty. 

The remainder of the paper is structured as follows: In \Cref{sec:background}, we 
review existing methods for computing conditional expectations and Bayesian quadrature. 
In \Cref{sec:cbq}, we formalise our  novel \textit{conditional Bayesian quadrature} algorithm.  
In \Cref{sec:theory}, we establish the theoretical convergence  of our method.
In \Cref{sec:experiments}, we provide empirical results and compare with baseline methods on challenging tasks in Bayesian sensitivity analysis, computational finance and decision making under uncertainty.



