\section{Conditional Bayesian Quadrature}\label{sec:cbq}
In this section, we formally introduce our method. 

In the first stage, we estimate the conditional expectation $\E[f(Y)\mid X=x_i]$ for every $i$. Note that if we fix $x_i$, then we can use the standard Bayesian quadrature. For every $x_i$, we put a GP prior $\GP(0, {k_\calY}_i)$ on $f$ and we can obtain the posterior $\bar{f}_i$ conditioned on the noiseless observations $\{f(y_{i,j})\}_{j=1}^{s_i}$. $\bar{f}_i$ is also a Gaussian process with mean and covariance. 
\begin{align}
    \begin{aligned}
        \bar{m}_i(y) &= {k_\calY}_i(y, \bY^i) {k_\calY}_i(\bY^i, \bY^i)^{-1} f(\bY^i) \\ 
        \bar{k}_i(y, y') &=  {k_\calY}_i(y, y') -  {k_\calY}_i(y, \bY^i) {k_\calY}_i(\bY^i, \bY^i)^{-1} {k_\calY}_i(\bY^i, y')
    \end{aligned}
\end{align}

Similar to standard Bayesian quadrature in Equation \eqref{eq:bq}, the  posterior taken through the integral with respect to $p(y\mid x_i)$ is also a Gaussian distribution with mean and variance:
\begin{align}\label{eq:stage_one}
    \begin{aligned}
    \E\left[\Pi\left[\bar{f}_i\right](x_i)\right] &= \Phi_i {k_\calY}_i(\bY^i, \bY^i)^{-1} f(\bY^i) \\ 
    \Var\left[\Pi\left[\bar{f}_i\right](x_i)\right] &=  \varphi_i -  \Phi_i {k_\calY}_i(\bY^i, \bY^i)^{-1} \Phi_i^\top
    \end{aligned}
\end{align}
where $\Phi_i = \int {k_\calY}_i(y, \bY^i) p(y\mid x_i) dy \in \R^{s_i}$ and $\varphi_i = \int {k_\calY}_i(y, y') p(y\mid x_i) p(y'\mid x_i) dydy' \in \R$.

Essentially, in the first stage GP we have obtained the mean $\E[\Pi\left[\bar{f}_i\right](x_i)]$ and the uncertainty $\Var[\Pi\left[\bar{f}_i\right](x_i)]$ for the conditional expectation $\E[f(Y)\mid X=x_i]$ at all observed values $x_i$. Our target is to obtain the mean and variance of conditional expectation at any value $x$, so we introduce the second Gaussian process.

In the second stage, we put a Gaussian process prior on $g: \calX \to \R$ with mean zero and covariance function $k_\calX: \calX \times \calX \to \R$ and assume that the mean estimates derived in the first stage follow a Gaussian distribution around the true value with heteroskedastic noise. 
\begin{align}
    \E\left[\Pi\left[\bar{f}_i\right](x_i)\right] = g(x_i) + \epsilon_i, \quad \epsilon_i \sim \calN \left(0, \sigma_i^2\right), \quad \sigma_i = \sqrt{ \Var \left[\Pi\left[\bar{f}_i\right]\left(x_i\right)\right]}, \quad i=1, \cdots n
\end{align}
With the GP prior and heteroskedastic Gaussian likelihood, we can obtain the posterior $\bar{g}$ with mean and covariance
\begin{align}\label{eq:stage_two}
\begin{aligned}
    \nu(x) &= k_\calX(x, \bX) (k_\calX(\bX, \bX) + \Sigma)^{-1} \left[\E\left[\Pi \left[\bar{f}_1 \right](x_1)\right], \cdots, \E\left[\Pi \left[\bar{f}_n \right](x_n)\right] \right]^\top \\
    q(x, x') &= k_\calX(x, x') -k_\calX(x, \bX) (k_\calX(\bX, \bX) + \Sigma)^{-1} k_\calX(\bX, x')
\end{aligned}
\end{align}
where $\Sigma = diag(\sigma_1, \cdots, \sigma_n) \in \R^{n \times n}$ and $\E\left[\Pi \left[\bar{f}_i \right](x_i)\right]$ is obtained from Equation \eqref{eq:stage_one}.

So we obtain the mean $\nu(x)$ and uncertainty estimate $q(x, x)$ for the conditional expectation $\E[f(Y)\mid X=x]$ at any given value of $x$.

\paragraph{Complexity}
% We have shown in section 2 that both standard Bayesian quadrature and kernel ridge regression requires the complexity of $\calO(n^3 s_i^3)$. 
The first stage of our approach has complexity of $\calO(s_i^3)$ and the second stage has complexity of $\calO(n^3)$, which are substantially smaller than the complexity of Bayesian quadrature which is $\calO(s_i^3 n^3)$. 
Scalable approaches in Gaussian process literature like sparse variational GP \citep{titsias2009variational} can be used to further improve the scalability of our approach.

% \fxb{Open question: Is it possible for use to share information for different conditional expectations. Example:   Suppose we are interested in $E[f_1(Y)\mid X=x_1]$ and $E[f_2(Y)\mid X=x_2]$. Can we do anything for these kinds of cases?}


\subsection{Stein conditional Bayesian quadrature}

In the two stage Gaussian process approach, we require the knowledge of the kernel mean embedding $\Phi_i$ for all $i$ in Equation \eqref{eq:stage_one}, which is a very prohibitive assumption in practice. 
Similar constraint also appears in standard Bayesian quadrature. In section 2, we have introduced Stein kernel and we know that if the covariance kernel $k_\calY$ is selected to be a Stein kernel, then the mean embedding $\Phi_i$ to forced to be zero. Obviously, we cannot afford $\Phi_i$ to be zero for all $i$ as giving zero function all the time is meaningless. Instead, we propose to add a small constant kernel $d_i$ so the kernel ${k_\calY}_i$ now becomes
\begin{align}
\begin{aligned}
    {k_\calY}_i(y, y') &= k_p(y, y') + d_i = T_p^y[T_p^{y'}[k_\calY(y, y')]] + d_i
    = \nabla_y \log p(y\mid x_i) k_\calY(y, y' ) \nabla_{y'} \log p(y'\mid x_i) + \\
    &\nabla_{y} \log p(y\mid x_i) \nabla_{y'} k_\calY(y, y')
    + \nabla_{y'} \log p(y'\mid x_i) \nabla_y k_\calY(y, y') + \nabla_y \nabla_{y'} k_\calY(y, y') + d_i
\end{aligned}
\end{align}
where $k_\calY$ is a base kernel.

Therefore, the kernel mean embedding under the new kernel ${k_\calY}_i$ is
\begin{align}
    \Phi_i = \int {k_\calY}_i(y, \bY^i) p(y\mid x_i)dy = [d_i, \cdots, d_i] \in \R^{s_i}, \quad \varphi_i =  \int {k_\calY}_i(y, y') p(y\mid x_i) p(y'\mid x_i) dy dy' = d_i
\end{align}

The constant $d_i$ along with other kernel hyperparameters including kernel lengthscale $l$ and amplitude $a$ are jointly learnt by maximizing the marginal likelihood \citep{GPML}. 

