
\begin{figure*}[t] % 't' to position the figure at the top of the page
\centering
\begin{subfigure}[b]{0.38\linewidth}
    \centering
\begin{tikzpicture}[scale=0.3, every node/.style={scale=0.8}]
  \node[latent] (x) at (0, 0) {$x_i^t$};
  \node[latent, right=0.7cm of x] (fx) {$f(x_i^t, \theta_t)$};
  \node[latent, above=0.5cm of fx] (theta) {$\theta_t$};
  \node[latent, right=0.7cm of fx] (IBQ) {$\hat{I}_{\text{BQ}}(\theta_t)$};
\node[latent, right=0.7cm of IBQ] (ICBQ) {$\hat{I}_{\mathrm{CBQ}}(\theta)$};
\node[latent, above=0.5cm of ICBQ] (thetanew) {$\theta$};
  % Edges
  \edge[->,>=stealth] {x} {fx};
  \edge[->,>=stealth] {theta} {fx};
  \edge[->,>=stealth] {fx} {IBQ};
  \edge[->,>=stealth] {IBQ} {ICBQ};
  \edge[->,>=stealth] {thetanew} {ICBQ};
  \edge[->,>=stealth] {theta} {ICBQ};
  
  % Plate
  % \plate [inner xsep=0.7cm, inner ysep=0.4cm, xshift=0.0cm, yshift=0.2cm, color=gray, label={[label distance=-0.7cm, yshift=0.0cm, xshift=0.4cm]above left:$i=1:N$}] {N} {(x)(fx)} {};
    \plate [inner xsep=0.8cm, inner ysep=0.5cm, xshift=0.0cm, yshift=0.2cm, color=black, rounded corners=10pt, label={[label distance=-0.7cm, yshift=0.0cm, xshift=0.4cm]above left:$i=1:N$}] {N} {(x)(fx)} {};

  \plate [inner xsep=1.4cm, inner ysep=0.7cm, xshift=-0.1cm, yshift=0.0cm, color=black, rounded corners=10pt, label={[label distance=-0.8cm, yshift=0.0cm, xshift=0.0cm]above left:$t=1:T$}] {T} {(x)(fx)(theta)(IBQ)}{};
\end{tikzpicture}
    % \caption{Directed acyclic graph representation of CBQ. Circle nodes indicate random variables and large rectangles correspond to  independent replications over indices.}
    % \label{fig:DAG} 
  \end{subfigure}
  \hspace{40pt}
\begin{subfigure}{0.45\textwidth}
\centering
\includegraphics[width=\linewidth]{figures/BQ_CBQ_contour.pdf}
\vspace{-20pt}
\end{subfigure}
    
\caption{\emph{Illustration of CBQ.} \textbf{Left:} Directed acyclic graph representation. Circle nodes indicate random variables and rectangles correspond to  independent replications over indices. \textbf{Right:} Posteriors on $I(\theta_{1:2})=[I(\theta_1), I(\theta_2)]^\top$ for $\theta_1 \approx \theta_2$. Unlike BQ, the CBQ posterior accounts for the relation between the two quantities.}
\label{fig:DAG_and_CBQ_bivariate_posterior}
\end{figure*}

\section{Methodology}\label{sec:cbq}

\vspace{-2mm}
\emph{Conditional Bayesian quadrature} (CBQ) provides a Bayesian hierarchical model for $I(\theta^*)$ for any $\theta^* \in \Theta$, and the posterior mean of this hierarchical model is called the CBQ estimator. The algorithm falls into the realm of regression-based methods and can therefore be expressed in two stages:
%
\begin{itemize}[topsep=0pt,leftmargin=*]
    \item \textbf{Stage 1:}  Compute $\hat{I}_\mathrm{BQ}(\theta_{1:T}), \sigma^2_\mathrm{BQ}(\theta_{1:T})$ to obtain the BQ posteriors on $I(\theta_1),\ldots,I(\theta_T)$. 
    \item \textbf{Stage 2:} Perform GP regression over $I(\theta)$ using the outputs of stage 1. The posterior mean $\hat{I}_\mathrm{CBQ}(\theta)$ is the CBQ estimator for $I(\theta)$, and the variance $k_{\mathrm{CBQ}}(\theta,\theta)$  quantifies uncertainty. 
\end{itemize}
%
This can be summarised using the directed acyclic graph in \Cref{fig:DAG_and_CBQ_bivariate_posterior}, where the first stage corresponds to the part of the model inside the largest plate, and the second stage corresponds to the remainder of the graph. The CBQ posterior mean and covariance 
are given by
\begin{align*}
\begin{aligned}
    & \hat{I}_{\mathrm{CBQ}}(\theta)  := m_\Theta(\theta)+k_\Theta(\theta, \theta_{1:T}) \big(k_\Theta(\theta_{1:T}, \theta_{1:T}) \\ & \quad + \mathrm{diag}(\lambda_{\Theta}+ \sigma^2_\mathrm{BQ}(\theta_{1:T}))\big)^{-1} (\hat{I}_\mathrm{BQ}\left(\theta_{1:T}) - m_\Theta(\theta_{1:T})\right), \\
    & k_{\mathrm{CBQ}}(\theta,\theta')  := k_{\Theta}(\theta,\theta') - k_\Theta(\theta,\theta_{1:T}) \big( k_{\Theta}(\theta_{1:T}, \theta_{1:T}) \\ &\quad + \mathrm{diag}(\lambda_{\Theta}+ \sigma^2_\mathrm{BQ}(\theta_{1:T})) \big)^{-1} k_\Theta(\theta_{1:T},\theta')
\end{aligned}
\end{align*}
where $\lambda_{\Theta} \geq 0$ acts as a regulariser,  $\hat{I}_\mathrm{BQ}(\theta_t)$ and $\sigma^2_\mathrm{BQ}(\theta_t)$ are the BQ posterior mean and variance for $I(\theta_t)$, $\mathrm{diag}(\lambda_{\Theta}+ \sigma^2_\mathrm{BQ}(\theta_{1:T})))$ is the diagonal matrix with vector $\lambda_{\Theta}+ \sigma^2_\mathrm{BQ}(\theta_{1:T}))$ on the diagonal and    $m_\Theta:\Theta \rightarrow \R$ and $k_{\Theta}:\Theta \times \Theta \rightarrow \R$ are the prior mean and covariance for the stage 2 GP. 
% Similarly to BQ, the ``quadrature" terminology is justified since  $\hat{I}_\mathrm{CBQ}(\theta) := \sum_{t=1}^T \sum_{i=1}^N w_{ti}^{\mathrm{CBQ}} f(x_i^t,\theta_t)$ for some weights $w_{it}^{\mathrm{CBQ}} \in \R$ when $m_\Theta(\theta)=0$.



The first stage corresponds to the BQ procedure highlighted in \Cref{sec:bayesian_quadrature}: we model $f(\cdot,\theta_t)$ with independent $\text{GP}(m^t_{\calX},k^t_{\calX})$ priors, condition on observations $f(x^t_{1:N},\theta_t)$, and consider the posterior distribution on $I(\theta_t)$ for all $t \in \{1,\ldots,T\}$. We therefore require access to closed-form expressions for each of the $T$ kernel mean embeddings and initial errors (see discussion in \Cref{appendix:tractable_kernel_means}). 
Note that at this stage, we do not share any samples across the estimators of $I(\theta_1), \ldots, I(\theta_T)$.
% ,Park2020}. 

In the second stage, we place a $\text{GP}(m_\Theta,k_\Theta)$ prior on $I:\Theta \rightarrow \R$, and assume $\hat{I}_\mathrm{BQ}(\theta_t)$ are noisy evaluations of $I(\theta_t)$: $\hat{I}_\mathrm{BQ}(\theta_t) = I(\theta_t) +\varepsilon_t$, where the noise terms $\varepsilon_t$ are independent zero-mean Gaussian noise with variance $\sigma^2_\mathrm{BQ}(\theta_t)$ for all $t \in \{1, \dots, T\}$. Since the variance is input-dependent, this corresponds to heteroscedastic GP regression \citep{Le2005}. We now briefly comment on the choice of prior and likelihood in this second stage:
\begin{itemize}[topsep=0pt,leftmargin=*]
    \item The $\text{GP}(m_\Theta,k_\Theta)$ prior can be used to encode prior knowledge about how the expectation $I(\theta)$ varies with the parameter $\theta$. Typically, the stronger this prior information, the faster the CBQ estimator's convergence rate will be; this statement will be made formal in \Cref{sec:theory}.

    \item The likelihood for the heteroscedastic GP is directly inherited from the BQ posteriors in the first stage: the posterior on $I(\theta_t)$ is a univariate normal with mean $\hat{I}_\mathrm{BQ}(\theta_{t})$ and variance $\sigma^2_\mathrm{BQ}(\theta_{t})$. As expected, when the number of samples $N$ grows, the BQ variance $\sigma^2_\mathrm{BQ}(\theta_t)$ will decrease, indicating that we are more certain about $I(\theta_t)$. This is then directly taken into account in stage 2. Note that  heteroscedasticity has previously been shown to be common in practice for LSMC \citep{Fabozzi2017}.
\end{itemize}



% Note that in the special case where $f$ does not depend on $\theta$ but $\mathbb{P}_{\theta}$ does, the stage 1 GP prior on $f$ implies directly a GP prior on $I(\theta)$. Such GP, called a conditional mean process in \cite{chau2021deconditional} (see Proposition 3.2), has mean $m_\Theta(\theta) = \mathbb{E}_{X \sim \mathbb{P}_\theta}[m_{\cal{X}}(X)]$ and covariance $k_{\Theta}(\theta,\theta') = \mathbb{E}_{X \sim \mathbb{P}_{\theta},X' \sim \mathbb{P}_{\theta'}}[k_{\calX}(X,X')]$ which could be used directly for the second stage of CBQ.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%%%%%%%%%%%%%%%%%%%%%%%%%%%%


% \paragraph{Comparison with regression-based methods} 
CBQ is closely related to  LSMC and KLSMC as it simply corresponds to different choices for the two stages. 
The main difference is in stage 1, where we use BQ rather than MC. This is where we expect the greatest gains for our approach due to the fast convergence rate of BQ estimators (this will be confirmed in \Cref{sec:theory}). For stage 2, we use heteroscedastic GP regression rather than polynomial or kernel ridge regression. As such, the second stage of KLSMC and CBQ is identical up to a minor difference in the way in which the Gram matrix $k_{\Theta}(\theta_{1:T}, \theta_{1:T})$ is regularised before inversion. 
Finally, one significant advantage of CBQ over LSMC and KLSMC is that it is a fully Bayesian model, meaning that we obtain a posterior distribution on $I(\theta)$ for any $\theta \in \Theta$.
The total computational cost of our approach is $\calO(T N^3 + T^3)$ due to the need to compute $T$ BQ estimators in the first stage and heteroscedastic GP regression in the second stage. This is higher than the cost of $\calO(TN+p^3)$ or $\calO(TN+T^3)$ of LSMC and KLSMC respectively, but as we will see in \Cref{sec:experiments}, CBQ will usually be competitive with these due to its faster convergence rate (derived in \Cref{thm:convergence}).
% Approximate GP approaches could be used to significantly reduce this cost; see e.g. \cite{titsias2009variational}. 

Interestingly, CBQ also provides us with a joint Gaussian posterior on the expectation at $\theta^\ast_1, \ldots, \theta^\ast_{T_{\text{Test}}} \in \Theta$ which has mean vector $\hat{I}_{\mathrm{CBQ}}(\theta^\ast_{1:T_{\text{Test}}})$ and covariance matrix $k_{\mathrm{CBQ}}(\theta^\ast_{1:T_{\text{Test}}},\theta^\ast_{1:T_{\text{Test}}})$. This can be computed at an  $\calO(T^2 T_{\text{test}})$ cost, and is illustrated in the right plot of~\Cref{fig:DAG_and_CBQ_bivariate_posterior} on a synthetic example from \Cref{sec:experiments}; as observed, CBQ takes into account that the expectation will be similar for similar parameter values, whereas standard BQ treats each expectation independently.



 % \paragraph{Comparison with other probabilistic numerical methods} 
A natural alternative would be to place a GP prior directly on $(x,\theta) \mapsto f(x,\theta)$ and condition on observations. The implied distribution on $I(\theta_1), \ldots, I(\theta_T)$ would also be a multivariate Gaussian distribution. 
This approach coincides with the multi-output Bayesian quadrature (MOBQ) approach of \cite{xi2018bayesian}. 
However, the computational cost is $\calO(N^3 T^3)$, due to fitting a GP on $N T$ observations, and quickly becomes intractable as $N$ or $T$ grow. 
% A further comparison of BQ and MOBQ can be found in~\Cref{appendix:cbq_mobq}.
The same holds true if $f$ does not depend on $\theta$, in which case the task reduces to the conditional mean process studied in Proposition 3.2 of \cite{chau2021deconditional}, and when $T=1$, we recover BQ. 
