

\section{Background}\label{sec:background}

 We aim to compute the conditional expectation $I(\theta) = \mathbb{E}_{X \sim \mathbb{P}_\theta}[f(X,\theta)]$, where we assume that  $\calX \subseteq \R^d$, $\Theta \subseteq \R^p$, and $f(\cdot,\theta)$ is in  $\mathcal{L}^2(\mathbb{P}_\theta) :=\{ h:\calX \rightarrow \R : \|h\|_{\mathcal{L}^2(\mathbb{P}_\theta)} = (\mathbb{E}_{X \sim \mathbb{P}_\theta}[h^2(X)])^{1/2} <\infty\}$, the space of square-integrable functions with respect to $\mathbb{P}_\theta$ for all $\theta \in \Theta$. The latter is a minimal assumption which ensures that Monte Carlo estimators satisfy the central limit theorem. 
Our observations are:
\begin{talign}
\begin{aligned}
\theta_{1:T} &:= [\theta_1, \cdots, \theta_T]^\top \in \Theta^T, \quad x^t_{1:N} := [x^t_1, \cdots, x^t_N]^\top \in \calX^N, \\
 f(x^t_{1:N}, \theta_t) &:= [f(x^t_1,\theta_t), \cdots, f(x^t_N,\theta_t)]^\top \in \R^N,
\end{aligned}
\end{talign}
for all $t \in \{1,\cdots,T\}$, where we use square brackets to indicate vectors. This could straightforwardly be extended to allow a different number of samples $N_t$ per parameter value $\theta_t$, but we do not consider this case in order to simplify notations throughout. In this section, we will review existing methods for computing conditional expectations and the core ingredient for our method: the Bayesian quadrature algorithm.

\subsection{Existing Methods for Computing Conditional Expectations}\label{sec:cond_exp}

Existing methods fall into two categories: sampling-based methods and regression-based methods. Throughout, we will assume that $x_{1:N}^t \sim \mathbb{P}_{\theta_t}$ for all $t \in \{1,\cdots,T\}$.

\vspace{-2mm}
\paragraph{Sampling-based Methods} 
We can construct a \emph{Monte Carlo} (MC) estimator \citep{Robert2004} for $I(\theta_t)$ through $\hat{I}_{\text{MC}}(\theta_t) := \frac{1}{N} \sum_{i=1}^N f(x_i^t,\theta_t)$. Unfortunately, we cannot estimate $I(\theta)$ for $\theta \notin \{\theta_{1}, \cdots,\theta_T \}$, and we can only use $N$ rather than $N T$ points to estimate each $I(\theta_t)$, making MC inappropriate for our task. A more suitable alternative is \emph{importance sampling} (IS).
Assume $\mathbb{P}_\theta$ has a Lebesgue density $p_\theta:\calX \rightarrow \R$ which has full support on $\calX$ for all $\theta \in \Theta$, and the integrand does not depend on $\theta$ (i.e. $f(x,\theta) = f(x)$). 
Then the IS estimator is able to make use of all $N T$ samples and can estimate $I(\theta)$ for any parameter $\theta \in \Theta$: $\hat{I}_{\text{IS}}(\theta) := \frac{1}{T} \sum_{t=1}^T \sum_{i=1}^N p_{\theta}(x_i^t)/p_{\theta_t}(x_i^t) f(x_i^t)$. 
The choice of importance distributions  $\mathbb{P}_{\theta_1},\cdots,\mathbb{P}_{\theta_T}$ has been studied in \citep{Glynn1989,Tang2013}, but alternatives beyond this parametric family of distributions could also be used \citep{Demange-Chryst2022}.

 
\paragraph{Regression-based Methods}
The main regression-based method is least-squares Monte Carlo (LSMC) \citep{longstaff2001valuing}, which is a two-stage approach. Stage 1 consists of computing MC estimators  $\hat{I}_{\text{MC}}(\theta_1), \cdots, \hat{I}_{\text{MC}}(\theta_T)$, whilst stage 2 consists of estimating $I(\theta)$ through linear or polynomial regression based on the estimates from stage 1. Other regression method could be used though; for kernel ridge regression \citep{Han2009}, we will refer to the algorithm as kernelised least-squares Monte Carlo (KLSMC). KLSMC can be recognised as a generalisation of the kernel mean shrinkage estimators of \cite{muandet2016kernelmeanshrinkage,chau2021deconditional}. 
% The LSMC estimator $\hat{I}_{\text{LSMC}}(\theta)$ solves the problem for $\mathcal{F}(\Theta)$ being a space of order$-p$ polynomials, whereas the KLSMC estimator $\hat{I}_{\text{KLSMC}}(\theta)$ solves it for $\mathcal{F}(\Theta)$ being a ball in a reproducing kernel Hilbert space (RKHS) \citep{berlinet2011reproducing}\footnote{A more appropriate name for this algorithm would be ``kernel regression Monte Carlo", but we follow the terminology in \citep{chau2021deconditional} for simplicity and to avoid confusing readers familiar with this literature.}. 
Clearly, both the performance and computational cost of these estimators will depend on the regression method. 
LSMC costs $\calO(TN + p^3)$ with $p$ being the order of polynomial, whereas KLSMC costs $\calO(TN + T^3)$. 
However, KLSMC will outperform LSMC when $I(\theta)$ cannot be approximated well by a low-order polynomial.



% \paragraph{Other Related Work} Alternative approaches for estimating $I(\theta)$ are based on multi-task or meta- learning \citep{xi2018bayesian,gessner2020active,Sun2021,Sun2023}. 
% This line of research tends to assume that several related expectations need to be computed, and the relationship between these expectations is encoded through a vector-valued RKHS, or that they are independent draws from a set of tasks. 
% Notably, they do not explicitly utilise the property that $I(\theta)$ is a smooth function of $\theta$, and will therefore be sub-optimal for our setting.
% Multilevel Monte Carlo methods are also popular in estimating expensive expectations, by combining samples from multiple levels of resolution~\citep{Giles2015}. However, they are not able to estimate new integrals $I(\theta^\ast)$ or $I(\theta)$ uniformly over $\theta \in \Theta$.


\subsection{Bayesian Quadrature}\label{sec:bayesian_quadrature}
In this section, we present Bayesian quadrature, the foundational component of our approach. Consider the expectation $I = \mathbb{E}_{X \sim \mathbb{P}} [f(X)]$ of some function $f:\calX \rightarrow \mathbb{R}$, where we emphasise that neither $f$ nor $\mathbb{P}$ depend on $\theta$ in this subsection. In Bayesian quadrature (BQ) \citep{Diaconis1988,OHagan1991BayesHermiteQ,Rasmussen2003,fx_quadrature}, we begin by positing a Gaussian process (GP) prior on $f$. We will denote this prior $\mathcal{GP}(m_{\calX},k_{\calX})$, where $m_\calX:\calX \rightarrow \mathbb{R}$ is the mean function and $k_{\calX}:\calX \times \calX \rightarrow \mathbb{R}$ is the covariance (or reproducing kernel) function. These two functions fully characterise the distribution, and can be used to encode prior knowledge about smoothness, periodicity, or sparsity of $f$.
Once a GP prior has been selected, we condition on noiseless function evaluations $f(x_{1:N}) = [f(x_1),\cdots,f(x_N)]^\top$ for $x_{1:N} \in \calX^N$. This leads to a posterior GP on $f$, which induces a univariate Gaussian posterior distribution $\mathcal{N}\big(\hat{I}_\mathrm{BQ},\sigma^2_\mathrm{BQ}\big)$ on $I$, where:
\vspace{-5pt}
\begin{talign}
\begin{aligned}
    \hat{I}_\mathrm{BQ} & = \mathbb{E}_{X \sim \mathbb{P}}[m_{\calX}(X)] + \mu(x_{1:N})^\top \big(k_{\calX}(x_{1:N}, x_{1:N}) + \lambda_{\calX} \Id_N \big)^{-1} \big(f(x_{1:N})-m_{\calX}(x_{1:N}) \big), \\
    \sigma^2_\mathrm{BQ} &= \mathbb{E}_{X,X'\sim \mathbb{P}}\left[k_{\calX}(X,X')\right] - \mu(x_{1:N})^\top \big(k_{\calX}(x_{1:N}, x_{1:N}) + \lambda_{\calX} \Id_N \big)^{-1} \mu(x_{1:N}).
\end{aligned}
\end{talign}

Here $\lambda_{\calX} \geq 0$ is a regularisation parameter, often called ``jitter'', which, although not essential from a statistical viewpoint, is often used to ensure the matrix can be numerically inverted \citep{Andrianakis2012}.
% and is known not to impact the asymptotic convergence rate of the GP \citep{Wendland2005}. 
The function  $\mu(x) = \mathbb{E}_{X \sim \mathbb{P}}[k_\calX(X,x)]$ is known as the kernel mean embedding~\cite{muandet2017kernel} of the distribution $\mathbb{P}$ and $\mathbb{E}_{X,X'\sim \mathbb{P}}\left[k_{\calX}(X,X')\right]$ is known as the initial error. 
These need to be available in closed-form, which is a rather strong requirement and does not hold for all pairs of kernel and distribution.
Fortunately, there are multiple solutions for this problem; see  Table 1 in~\citep{fx_quadrature}, ~\citep{Nishiyama2016}, the \texttt{ProbNum} package~\citep{Wenger2021}, or Stein reproducing kernels \citep{anastasiou2023stein}. 
% A discussion is provided in \Cref{appendix:tractable_kernel_means}. 




The posterior mean $\hat{I}_\mathrm{BQ}$ provides a point estimate for $I$ whilst the posterior variance $\sigma^2_\mathrm{BQ}$ gives a notion of uncertainty for $I$ which arises due to having only observed $f$ at $N$ points.
For BQ to be well-calibrated and the posterior variance $\sigma^2_\mathrm{BQ}$ to be meaningful, we need to select the GP prior and all associated hyperparameters carefully; this is usually achieved through empirical Bayes. 
% This consists of maximising the log-marginal likelihood over all hyperparameters $\gamma$ of the kernel $k_{\calX}$. 
% For example, kernels are often parametrised through an amplitude: $k_{\calX}(x,x') = \lambda \tilde{k}_{\calX}(x,x')$ with $\lambda>0$, in which case the optimum is known in closed-form. 
% For other parameters governing smoothness or lengthscale, the same does not hold and these must be optimised numerically; 
% A detailed discussion is provided in \Cref{appendix:hyperparameter_selection}.
It is noteworthy that BQ does not impose restrictions on how $x_{1:N}$ is selected, and as such does not require independent realisations from $\mathbb{P}$. 
In fact, a number of active learning approaches have proven popular, see \cite{gessner2020active}. 

The convergence rate of the BQ estimator has been  studied extensively \citep{fx_quadrature,wynne2021convergence} and is particularly fast for low- to mid-dimensional smooth integrands. This has to be contrasted with the computational cost, which is inherited from GP regression and is $\calO(N^3)$. For this reason, BQ has principally been applied to problems where sampling or evaluating the integrand is very expensive and usually only a small number of samples are available (small $N$).
Examples are differential equation solvers \citep{Kersting2016}, variational inference \citep{Acerbi2018} and simulator-based inference \citep{Bharti2023} to applications in computer graphics \citep{Marques2013} and tsunami modelling \citep{li2022multilevel}. 
% For cheaper problems, \citep{Jagadeeswaran2018,Karvonen2017symmetric,Karvonen2019} propose BQ methods where the computational cost is much lower, but these are applicable only with specific point sets $x_{1:N}$ and distributions $\mathbb{P}$.




