

\section{Introduction}\label{sec:introduction}


This paper considers the computational challenge of estimating certain intractable expectations which arise in machine learning, statistics, and beyond. Given a function $f:\calX \times \Theta \rightarrow \R$, we are interested in estimating \emph{conditional expectations} (sometimes also called parametric expectations) $I: \Theta \rightarrow \R$ uniformly over the parameter space $\Theta$, where:
\begin{talign*}
    I(\theta) = \E_{X \sim \mathbb{P}_\theta}[f(X,\theta)]=\int_\calX  f(x, \theta) \mathbb{P}_\theta(\mathrm{d} x), 
\end{talign*}
and $\{\mathbb{P}_\theta\}_{\theta \in \Theta}$ is a family of distributions on the integration domain $\calX$. We will assume that $I(\theta)$ is sufficiently smooth in $\theta$ so that $I(\theta),I(\theta')$ are similar given close enough parameters $\theta,\theta'$, but that $I$ is not available in closed-form and must be approximated through samples and function evaluations. 

The computational challenge of approximating conditional expectations arises in many fields. It must be tackled when calculating tail probabilities in rare-event simulation \citep{Tang2013}, computing moment generating, characteristic, or cumulative distribution functions \citep{Giles2015,Krumscheid2018}. It also arises when computing the conditional value at risk or various valuations of options \citep{longstaff2001valuing,alfonsi2022many}, for Bayesian sensitivity analysis \citep{Lopes2011,Kallioinen2021}, or even more broadly for scientific sensitivity analysis; see for example Sobol indices \citep{Sobol2001}. Conditional expectations $I(\theta)$ are also often computed as an intermediate quantity. 
For example, given $\phi:\R \rightarrow \R$ and some probability distribution $\mathbb{Q}$ on $\Theta$, we are often interested in the \emph{nested expectation} given by $\mathbb{E}_{\theta \sim \mathbb{Q}}[\phi(I(\theta))]$ \citep{Hong2009,Rainforth2018}. This problems comes about when computing the expected information gain in Bayesian experimental design \citep{Chaloner1995}, and for computing the expected value of partial perfect information in health economics~\citep{heath2017review}.

Methods for computing $I(\theta)$ generally select $T$ parameter values $\theta_1,\ldots,\theta_T \in \Theta$, then simulate $N$ realisations from each corresponding probability distribution $\mathbb{P}_{\theta_1}, \ldots, \mathbb{P}_{\theta_T}$ at which they evaluate the integrand $f$, leading to a total of $N T$ evaluations. 
The usual approach is to use classical Monte Carlo methods to estimate $I(\theta_1), \ldots, I(\theta_T)$, but in many applications we are also interested in estimating either $I(\theta^*)$ for a fixed $\theta^* \notin \{\theta_1,\ldots,\theta_T\}$, or  $I(\theta)$ uniformly over $\theta \in \Theta$. 
As a result, a second step combining the estimates of $I(\theta_1), \ldots, I(\Theta_T)$ is often required to complete the task. 

The most straightforward approach to estimating conditional expectation is importance sampling \citep{Glynn1989,Madras1999,Tang2013,Demange-Chryst2022}, where $I(\theta)$ is estimated by weighting function evaluations to account for the fact that the samples were not obtained from $\mathbb{P}_\theta$ but from the importance distributions $\mathbb{P}_{\theta_1}, \ldots, \mathbb{P}_{\theta_T}$. 
Unfortunately, this approach is only applicable when $f$ does not depend on $\theta$, and it is usually difficult to identify  importance distributions leading to an accurate estimator for small $N$ and $T$. 
Alternatively, least-squares Monte Carlo  \citep{longstaff2001valuing,alfonsi2022many} first estimates $I(\theta_1),\ldots, I(\theta_T)$ through Monte Carlo, then estimates $I(\theta)$ through linear or polynomial regression based on these estimates. These methods are therefore dependent on the accuracy of the Monte Carlo estimators and regression method. 

Overall and in addition, there are two main limitations which all of these existing methods suffer from. Firstly, they are very sample-intensive; i.e. they require a relatively large number of function evaluations (i.e. $N$ and $T$) to reach a given level of accuracy, which makes them infeasible if sampling or evaluating the integrand is expensive. Secondly, obtaining a finite-sample quantification of uncertainty for $I(\theta)$ is often infeasible. This is a significant limitation for challenging integration problems, for which we would ideally like to know how accurate our estimator is likely to be.

To tackle these limitations, we propose a novel algorithm called \emph{conditional Bayesian quadrature} (CBQ). The name comes from the fact that our approach extends the Bayesian quadrature algorithm~\citep{Diaconis1988,OHagan1991BayesHermiteQ,Rasmussen2003,fx_quadrature} to the computation of conditional expectations. As such, CBQ falls in the line of work on probabilistic numerical methods \citep{hennig2015probabilistic,Cockayne2017BPNM,Oates2019Modern,Hennig2022}.
Our algorithm is based on a hierarchical Bayesian model consisting of two-stages of Gaussian process regression, and leads to a univariate Gaussian posterior distribution on $I(\theta)$ whose mean and variance are parametrised by $\theta$. 

This approach allows us to mitigate the two main limitations of existing methods. Firstly, we show both theoretically and empirically that our method is more sample efficient than alternatives under mild smoothness conditions on $f$ and $I(\theta)$ whenever the dimension of $\calX$ and $\Theta$ is not too large. As a result, a desired accuracy can be reached with smaller $N$ and $T$, and the method will therefore be preferable for expensive problems. Secondly, the fact that we have an entire posterior distribution on $I(\theta)$ allows us to provide finite-sample Bayesian quantification of uncertainty. 

The remainder of the paper is structured as follows: In \Cref{sec:background}, we 
review existing methods for computing conditional expectations and Bayesian quadrature. 
In \Cref{sec:cbq}, we formalise our algorithm called \textit{conditional Bayesian quadrature}.  
In \Cref{sec:theory}, we prove the convergence rate of our method.
In \Cref{sec:experiments}, we provide empirical results and compare with baseline methods.



