
\begin{figure*}[t] % 't' to position the figure at the top of the page
\centering
\begin{subfigure}[b]{0.38\linewidth}
    \centering
\begin{tikzpicture}[scale=0.3, every node/.style={scale=0.8}]
  \node[latent] (x) at (0, 0) {$x_i^t$};
  \node[latent, right=0.7cm of x] (fx) {$f(x_i^t, \theta_t)$};
  \node[latent, above=0.5cm of fx] (theta) {$\theta_t$};
  \node[latent, right=0.7cm of fx] (IBQ) {$\hat{I}_{\text{BQ}}(\theta_t)$};
\node[latent, right=0.7cm of IBQ] (ICBQ) {$\hat{I}_{\mathrm{CBQ}}(\theta)$};
\node[latent, above=0.5cm of ICBQ] (thetanew) {$\theta$};
  % Edges
  \edge[->,>=stealth] {x} {fx};
  \edge[->,>=stealth] {theta} {fx};
  \edge[->,>=stealth] {fx} {IBQ};
  \edge[->,>=stealth] {IBQ} {ICBQ};
  \edge[->,>=stealth] {thetanew} {ICBQ};
  \edge[->,>=stealth] {theta} {ICBQ};
  
  % Plate
  % \plate [inner xsep=0.7cm, inner ysep=0.4cm, xshift=0.0cm, yshift=0.2cm, color=gray, label={[label distance=-0.7cm, yshift=0.0cm, xshift=0.4cm]above left:$i=1:N$}] {N} {(x)(fx)} {};
    \plate [inner xsep=0.8cm, inner ysep=0.5cm, xshift=0.0cm, yshift=0.2cm, color=black, rounded corners=10pt, label={[label distance=-0.7cm, yshift=0.0cm, xshift=0.4cm]above left:$i=1:N$}] {N} {(x)(fx)} {};

  \plate [inner xsep=1.4cm, inner ysep=0.7cm, xshift=-0.1cm, yshift=0.0cm, color=black, rounded corners=10pt, label={[label distance=-0.8cm, yshift=0.0cm, xshift=0.0cm]above left:$t=1:T$}] {T} {(x)(fx)(theta)(IBQ)}{};
\end{tikzpicture}
    % \caption{Directed acyclic graph representation of CBQ. Circle nodes indicate random variables and large rectangles correspond to  independent replications over indices.}
    % \label{fig:DAG} 
  \end{subfigure}
  \hspace{40pt}
\begin{subfigure}{0.45\textwidth}
\centering
\includegraphics[width=\linewidth]{figures/BQ_CBQ_contour.pdf}
\vspace{-20pt}
\end{subfigure}
    
\caption{\emph{Illustration of CBQ.} \textbf{Left:} Directed acyclic graph representation. Circle nodes indicate random variables and rectangles correspond to  independent replications over indices. \textbf{Right:} BQ and CBQ posteriors on $I(\theta_{1:2})=[I(\theta_1), I(\theta_2)]^\top$ for $\theta_1 \approx \theta_2$. Unlike BQ, the CBQ posterior accounts for the relation between the two quantities.}
\label{fig:DAG_and_CBQ_bivariate_posterior}
\vspace{-15pt}
\end{figure*}

\section{Methodology}\label{sec:cbq}

\vspace{-2mm}
\emph{Conditional Bayesian quadrature} (CBQ) provides a Bayesian hierarchical model for $I(\theta^*)$ for any $\theta^* \in \Theta$, and the posterior mean of this hierarchical model is called the CBQ estimator. The algorithm falls into the realm of regression-based methods and can therefore be expressed in two stages:
%
\begin{itemize}[topsep=0pt,leftmargin=*]
    \item \textbf{Stage 1:}  Compute $\hat{I}_\mathrm{BQ}(\theta_{1:T}), \sigma^2_\mathrm{BQ}(\theta_{1:T})$ to obtain the BQ posterior mean and variance on $I(\theta_1),\ldots,I(\theta_T)$. 
    \item \textbf{Stage 2:} Perform GP regression over $I(\theta)$ using the outputs of stage 1. The posterior mean $\hat{I}_\mathrm{CBQ}(\theta)$ is the CBQ estimator for $I(\theta)$, and the variance $k_{\mathrm{CBQ}}(\theta,\theta)$  quantifies uncertainty. 
\end{itemize}
%
An illustrative figure is provided in \Cref{fig:illustration}.
This two-stage algorithm can also be summarised using the directed acyclic graph in \Cref{fig:DAG_and_CBQ_bivariate_posterior}, where the first stage corresponds to the part of the model inside the largest plate, and the second stage corresponds to the remainder of the graph. 
The CBQ posterior mean and covariance 
are given by
\begin{align*}
\begin{aligned}
    & \hat{I}_{\mathrm{CBQ}}(\theta)  := m_\Theta(\theta)+k_\Theta(\theta, \theta_{1:T}) \big(k_\Theta(\theta_{1:T}, \theta_{1:T}) \\ & \quad + \mathrm{diag}(\lambda_{\Theta}+ \sigma^2_\mathrm{BQ}(\theta_{1:T}))\big)^{-1} (\hat{I}_\mathrm{BQ}\left(\theta_{1:T}) - m_\Theta(\theta_{1:T})\right), \\
    & k_{\mathrm{CBQ}}(\theta,\theta')  := k_{\Theta}(\theta,\theta') - k_\Theta(\theta,\theta_{1:T}) \big( k_{\Theta}(\theta_{1:T}, \theta_{1:T}) \\ &\quad + \mathrm{diag}(\lambda_{\Theta}+ \sigma^2_\mathrm{BQ}(\theta_{1:T})) \big)^{-1} k_\Theta(\theta_{1:T},\theta')
\end{aligned}
\end{align*}
where the observations $\{x_{1:N}^t,f(x_{1:N}^t,\theta_t)\}_{t=1}^T$ enters implicitly through $\hat{I}_\mathrm{BQ}(\theta_{1:T})$.
The terms $\hat{I}_\mathrm{BQ}(\theta_t)$ and $\sigma^2_\mathrm{BQ}(\theta_t)$ are the BQ posterior mean and variance for $I(\theta_t)$, $\mathrm{diag}(\lambda_{\Theta}+ \sigma^2_\mathrm{BQ}(\theta_{1:T})))$ is the diagonal matrix with vector $\lambda_{\Theta}+ \sigma^2_\mathrm{BQ}(\theta_{1:T}))$ on the diagonal and where $\lambda_{\Theta} \geq 0$ acts as a regulariser. We also have    $m_\Theta:\Theta \rightarrow \R$ and $k_{\Theta}:\Theta \times \Theta \rightarrow \R$ which are the prior mean and covariance for the stage 2 GP. 
Similarly to BQ, the ``quadrature" terminology is justified since  $\hat{I}_\mathrm{CBQ}(\theta) := \sum_{t=1}^T \sum_{i=1}^N w_{i,t}^{\mathrm{CBQ}} f(x_i^t,\theta_t)$ for some weights $w_{i,t}^{\mathrm{CBQ}} \in \R$ when $m_\Theta(\theta)=0$.


The first stage corresponds to the BQ procedure highlighted in \Cref{sec:bayesian_quadrature}: we model $f(\cdot,\theta_t)$ with independent $\text{GP}(m^t_{\calX},k^t_{\calX})$ priors, condition on observations $f(x^t_{1:N},\theta_t)$, and consider the posterior distribution on $I(\theta_t)$ for all $t \in \{1,\ldots,T\}$. We therefore require access to closed-form expressions for each of the $T$ kernel mean embeddings and initial errors (see discussion in \Cref{appendix:tractable_kernel_means} on the pairs of kernel and distribution that have a closed form kernel mean embedding). 
Note that at this stage, we do not share any samples across the estimators of $I(\theta_1), \ldots, I(\theta_T)$.
% ,Park2020}. 

In the second stage, we place a $\text{GP}(m_\Theta,k_\Theta)$ prior on $I:\Theta \rightarrow \R$, and assume $\hat{I}_\mathrm{BQ}(\theta_t)$ are noisy evaluations of $I(\theta_t)$: $\hat{I}_\mathrm{BQ}(\theta_t) = I(\theta_t) +\varepsilon_t$, where the noise terms $\varepsilon_t$ are independent zero-mean Gaussian noise with variance $\sigma^2_\mathrm{BQ}(\theta_t)$ for all $t \in \{1, \dots, T\}$. 
Note that $\hat{I}_\mathrm{BQ}(\theta_t)$ is a deterministic function of independent samples $\theta_t, x^t_1, \cdots, x^t_N$ across $t = 1, \cdots, T$, so $\hat{I}_\mathrm{BQ}(\theta_1), \ldots, \hat{I}_\mathrm{BQ}(\theta_T)$ are also independent. 
As the variance $\epsilon_t$ is input-dependent, this corresponds to heteroscedastic GP regression \citep{Le2005}. 
We now briefly comment on the choice of prior and likelihood in this second stage:
\begin{itemize}[topsep=0pt,leftmargin=*]
    \item The $\text{GP}(m_\Theta,k_\Theta)$ prior can be used to encode prior knowledge about how the expectation $I(\theta)$ varies with the parameter $\theta$. Typically, the stronger this prior information, the faster the CBQ estimator's convergence rate will be; this statement will be made formal in \Cref{sec:theory}.

    \item The likelihood for the heteroscedastic GP is directly inherited from the BQ posteriors in the first stage: the posterior on $I(\theta_t)$ is a univariate normal with mean $\hat{I}_\mathrm{BQ}(\theta_{t})$ and variance $\sigma^2_\mathrm{BQ}(\theta_{t})$. As expected, when the number of samples $N$ grows, the BQ variance $\sigma^2_\mathrm{BQ}(\theta_t)$ will decrease, indicating that we are more certain about $I(\theta_t)$. This is then directly taken into account in stage 2. Note that  heteroscedasticity has previously been shown to be common in practice for LSMC \citep{Fabozzi2017}.
\end{itemize}



% Note that in the special case where $f$ does not depend on $\theta$ but $\mathbb{P}_{\theta}$ does, the stage 1 GP prior on $f$ implies directly a GP prior on $I(\theta)$. Such GP, called a conditional mean process in \cite{chau2021deconditional} (see Proposition 3.2), has mean $m_\Theta(\theta) = \mathbb{E}_{X \sim \mathbb{P}_\theta}[m_{\cal{X}}(X)]$ and covariance $k_{\Theta}(\theta,\theta') = \mathbb{E}_{X \sim \mathbb{P}_{\theta},X' \sim \mathbb{P}_{\theta'}}[k_{\calX}(X,X')]$ which could be used directly for the second stage of CBQ.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%%%%%%%%%%%%%%%%%%%%%%%%%%%%


% \paragraph{Comparison with regression-based methods} 
CBQ is closely related to  LSMC and KLSMC as it simply corresponds to different choices for the two stages. 
The main difference is in stage 1, where we use BQ rather than MC. This is where we expect the greatest gains for our approach due to the fast convergence rate of BQ estimators (this will be confirmed in \Cref{sec:theory}). For stage 2, we use heteroscedastic GP regression rather than polynomial or kernel ridge regression. As such, the second stage of KLSMC and CBQ is identical up to a minor difference in the way in which the Gram matrix $k_{\Theta}(\theta_{1:T}, \theta_{1:T})$ is regularised before inversion. 
Finally, one significant advantage of CBQ over LSMC and KLSMC is that it is a fully Bayesian model, meaning that we obtain a posterior distribution on $I(\theta)$ for any $\theta \in \Theta$.

The total computational cost of our approach is $\calO(T N^3 + T^3)$ due to the need to compute $T$ BQ estimators in the first stage and heteroscedastic GP regression in the second stage. 
Approximate GP approaches such as~\cite{titsias2009variational} could \emph{not} be used to reduce the cost because they introduce an additional layer of approximation which will slow down the convergence rate of CBQ. 
The cost of CBQ is higher than the cost of $\calO(TN+p^3)$ or $\calO(TN+T^3)$ of LSMC and KLSMC respectively, but as we will see in \Cref{sec:experiments}, the higher computational cost of CBQ will be offset competitive by faster convergence (derived in \Cref{thm:convergence}) and is more competitive compared to baseline methods (see 
\Cref{sec:experiments}).  
Additionally in many applications (such as the SIR model in \Cref{sec:experiments}), the cost of evaluating the integrand will be much larger than the cost of estimation methods, so data-efficient method like CBQ will be more efficient overall. 

Interestingly, CBQ also provides us with a joint Gaussian posterior on the expectation at $\theta^\ast_1, \ldots, \theta^\ast_{T_{\text{Test}}} \in \Theta$ which has mean vector $\hat{I}_{\mathrm{CBQ}}(\theta^\ast_{1:T_{\text{Test}}})$ and covariance matrix $k_{\mathrm{CBQ}}(\theta^\ast_{1:T_{\text{Test}}},\theta^\ast_{1:T_{\text{Test}}})$. This can be computed at an  $\calO(T^2 T_{\text{test}})$ cost, and is illustrated in the right plot of~\Cref{fig:DAG_and_CBQ_bivariate_posterior} on a synthetic example from \Cref{sec:experiments}; as observed, CBQ takes into account of covariances between test points in that the integral value will be similar for similar parameter values, whereas standard BQ treats each integral value independently.



 % \paragraph{Comparison with other probabilistic numerical methods} 
A natural alternative would be to place a GP prior directly on $(x,\theta) \mapsto f(x,\theta)$ and condition on all 
$N \times T$ observations. 
The implied distribution on $I(\theta_1), \ldots, I(\theta_T)$ would also be a multivariate Gaussian distribution. 
This approach coincides with the multi-output Bayesian quadrature (MOBQ) approach of \cite{xi2018bayesian} where multiple integrals are considered simultaneously. 
However, the computational cost of MOBQ is $\calO(N^3 T^3)$, due to fitting a GP on $N T$ observations, and quickly becomes intractable as $N$ or $T$ grow. 
A further comparison of BQ and MOBQ can be found in~\Cref{appendix:cbq_mobq}.
The same holds true if $f$ does not depend on $\theta$, in which case the task reduces to the conditional mean process studied in Proposition 3.2 of \cite{chau2021deconditional}, and when $T=1$, we recover standard Bayesian quadrature. 

\paragraph{Hyperparameters}
The hyperparameter selection for CBQ boils down to the choice of GP interpolation hyperparameters at stage 1 and the choice of GP regression hyperparameters at stage 2. 
To simplify this choice, we renormalise all our function values before performing GP regression and interpolation. 
This is done by first subtracting the empirical mean and then dividing by the empirical standard deviation. 
The choice of covariance functions $k_\calX$ and $k_\Theta$ is made on a case-by-case basis in order to both encode properties we expect the target functions to have, but also to ensure that the corresponding kernel mean is available in closed-form (see \Cref{appendix:tractable_kernel_means}). 
Once this is done, we typically still need to make a choice of hyperparameters for both kernel: lengthscales $l_\calX$, $\l_\Theta$ and amplitudes $A_\calX, A_\Theta$. 
We also need to select the regularizer $\lambda_\calX, \lambda_\Theta$. 
$\lambda_\calX$ is fixed to be $0$ as suggested by \Cref{thm:convergence}, and the rest of the hyperparameters are selected through empirical Bayes, which consists of maximising the log-marginal likelihood.
For more details on hyperparameter selection, please refer to \Cref{appendix:hyperparameter_selection}.