\section{Conditional Bayesian Quadrature}\label{sec:cbq}
We have a probability space $(\Omega, \calF, \mathbb{P})$, then we define two random variables $X: \Omega \to \calX$ and $Y: \Omega \to \calY$. We also define a RKHS $\calH_\calY$ with inner product $\langle \cdot, \cdot \rangle_{\calH_\calY}$ and kernel $K_\calY: \calY \times \calY \to \R$. $\calL(\calH_\calY)$ is the set of all bounded linear operators from $\calH_\calY$ to $\calH_\calY$. We use $\x, \y$ to denote elements from set $\calX$ and $\calY$, and we use $g_x: \calX \to \R$ and $g_y: \calY \to \R$ to denote functions of  $\R^\calX, \R^\calY$. Specifically, $g_y \in \calH_\calY \subset \R^\calY$. Finally, we denote our observations as $\bX = \{\x_i\}_{i=1}^{n}, \bY = \{\y_i\}_{i=1}^{m}$.

Now we consider the problem of integrating function $g \in \calH_\calY$ with respect to a conditional measure and we call it "Conditional Bayesian Quadrature". 

The integral of interest is
\begin{align}
    \Pi[g](\x) = \int g d\pi_{Y|X=\x}
\end{align}

The challenging part is that we only observe a finite number of sample points $\{\x_i\}_{i=1}^{n}$ and $\{\y_i, g(\y_i)\}_{i=1}^{m}$ and the value $\x$ that we want to condition on for prediction is very likely to be unobserved, in which case naive Monte Carlo sampling is not applicable any more.

Therefore, instead of estimating the value of integration with respect to a single value $\x_i$ to be conditioned on, we are estimating a function $\Pi[g]: \calX \to \R$ that takes any value $\x$ and return the integral of $g$ with respect to the conditioning measure on $\x$. 

\begin{align}
    \Pi[g](\cdot) = \int g d\pi_{Y|X=\cdot}
\end{align}

Similar to Bayesian Quadrature above, we also propose our estimator from three different perspectives.

\subsection{Ridge Regression Solution.}
Suppose that $\calH_\calX$ is a RKHS with reproducing kernel $k_\calX: \calX \times \calX \to \R$, and a vector-valued reproducing kernel Hilbert space $\calH_\Gamma$ with reproducing kernel $k_\Gamma: \calX \times \calX \to \calH_\calY$. The elements in RKHS $\calH_\Gamma$ are functions $F:\calX \to \calH_\calY$. In conditional Bayesian quadrature, we only consider the case where $k_\Gamma(\x,\x')=k_\calX(\x,\x')Id_{\calH_\calY}$.

The reproducing property of \eqref{eq:reproducing_1} becomes
\begin{align*}
\begin{split}
    \PSi{K_{\Gamma x}(g_y), K_{\Gamma x'} (g_{y'})}{\calH_\Gamma} &= \PSi{g_y, K_{\Gamma x'} (g_{y'})(x)}{\calH_\calY} \\
    &= \PSi{g_y, k_\Gamma(x, x')(g_{y'})}{\calH_\calY} \\
    &= \PSi{g_y, k_\calX(x, x')g_{y'}}{\calH_\calY} \\
    &= \PSi{k_\calX(x, \cdot), k_\calX(x', \cdot)}{\calH_\calX} \PSi{g_y, g_{y'}}{\calH_\calY}
\end{split}
\end{align*}

Similar to kernel mean embedding, we have the conditional kernel mean embedding.
\begin{align}
\begin{split}
    \Pi[g](\x) 
    &= \int g d\pi_{Y|X=\x} \\
    &= \PSi{g, \int k_\calY(y, \cdot) d\pi_{Y|X=\x}}{\calH_\calY} \\
    &\coloneqq \PSi{g, \calU_{Y|X=\x}}{\calH_\calY}
\end{split}
\end{align}
where $\calU_{Y|X=\x} \in \calH_\calY$ is the conditional kernel mean embedding. 
\begin{align}
    \calU_{Y|X=\x} = \int k_\calY(y, \cdot) d\pi_{Y|X=\x}
\end{align}

Since we are now estimating the function $\Pi[g](\cdot): \calX \to \R$, we hope to have a conditional kernel mean embedding $\calU_{Y|X}$ for any given value $\x$ \citep{measure_ckme}.
\begin{align}
    \calU_{Y|X} = \E_{Y|X}[k_\calY(\y, \cdot) | X]
\end{align}

$\calU_{Y|X}$ is a $X$-measurable random variable taking values in $\calH_\calY$ in the sense that $ \forall B \in \calB(\calH_\calY)$, $F_{Y|X}^{-1}(B) \in \sigma(X)$ and $\calU_{Y|X}$ is also an element of $\calH_\Gamma$ since $\calU_{Y|X}(\x) = \calU_{Y|X=\x}$. 

\paragraph{Remark:} It is more common to view $\calU_{Y|X}$ as an operator from $\calH_\calX$ to $\calH_\calY$. To remain consistency, we do not discuss operators here.

Now the target is to estimate the conditional kernel mean embedding $\calU_{Y|X}$, we have the objective.
\begin{align*}
    \argmin_{F \in \calH_\Gamma} \calE(F), \quad \calE(F) = \E_X[\norm{\calU_{Y|X} - F(X)}{\calH_\calY}^2]
\end{align*}

The empirical version is 

\begin{align}
    \argmin_{F \in \calH_\Gamma} \widehat{\calE(F)}, \quad \widehat{\calE(F)} = \sum_{p=1}^{n} \norm{\calU_{Y|X=\x_p} - F(\x_p)}{\calH_\calY}^2
\end{align}
% In the Bayesian Quadrature literature, the tractable form of kernel mean embedding on a given observation $\calU_{Y|X=\x_p}$ is assumed to be known. 

According to Riesz representer theorem, the optimal function $F^\dagger \in \calH_\Gamma$ is the linear combination of feature maps.

\begin{align}\label{eq:linear_comb_3}
\begin{split}
    F^\dagger(\cdot) = \sum_{p=1}^{n} K_{\Gamma \x_p} (f_p)(\cdot)
    = \sum_{p=1}^{n} k_\calX(\x_p, \cdot) f_p
\end{split}
\end{align}
where the coefficients $f_p \in \calH_\calY$ satisfy the linear equations

\begin{align}
    \sum_{p=1}^{n} k_\calX(\x_j, \x_p) f_p = \calU_{Y|X=\x_j}, \quad \forall j \in \{1,2, \cdots, n\}
\end{align}

The coefficients $\bff = [f_1, f_2, \cdots, f_n]^\top$ has a closed-form expression $\bff = k_\calX(\bX, \bX)^{-1} \Upsilon$ and the solution \eqref{eq:linear_comb_3} becomes

\begin{align}
    F^\dagger(\cdot) = k_\calX(\cdot, \bX) k_\calX(\bX, \bX)^{-1} \Upsilon
\end{align}
where $\Upsilon = [\calU_{Y|X=\x_1}, \calU_{Y|X=\x_2}, \cdots, \calU_{Y|X=\x_n}]^\top$ is a vector of $n$ functions and $\calU_{Y|X=\x_i} \in \calH_\calY$ for each $i \in \{1,2, \cdots, n\}$.

A little sanity check here: Suppose $F$ takes a value $\x' \in \calX$ and then $k_\calX(\x', \bX)^\top k_\calX(\bX, \bX)^{-1}$ is a $1 \times n$ dimension vector, so $F(\x')$ is a linear combination of $\calU_{Y|X=\x_1}$ and so $F^\dagger(\x') \in \calH_\calY$, which satisfies the requirement that $F: \calX \to \calH_\calY$ is an element of $\calH_\Gamma$.

The final estiamte of the conditional expectation given a value $\x'$ is
\begin{align}
\begin{split}
    \Pi[g](\x') 
    &= \PSi{g, \calU_{Y|X=\x'}}{\calH_\calY} \\
    &= \PSi{g, F^\dagger(\x')}{\calH_\calY} \\
    &= \PSi{g, k_\calX(\x', \bX) k_\calX(\bX, \bX)^{-1} \Upsilon}{\calH_\calY} \\
    &= k_\calX(\x', \bX) k_\calX(\bX, \bX)^{-1} \Upsilon_g
\end{split}
\end{align}

where
\begin{align*}
    \Upsilon_g
    &= [\PSi{g, \calU_{Y|X=\x_1}}{\calH_\calY}, \PSi{g, \calU_{Y|X=\x_2}}{\calH_\calY}, \cdots, \PSi{g, \calU_{Y|X=\x_n}}{\calH_\calY}]^\top \\
    &= [\int g d\pi_{Y|X=\x_1}, \int g d\pi_{Y|X=\x_2}, \cdots, \int g d\pi_{Y|X=\x_n}]^\top
\end{align*}

If we only observe one sample $\y^i$ for each conditioning value $\x^i$, i.e $m_i = 1 \quad \forall i$, and $\ n = m$, then we can easily have:

\begin{align}
\begin{split}
    \Upsilon_g = [g(\y_1^1), g(\y_1^2), \cdots, g(\y_1^{n})] \quad \text{and} \quad 
    \Pi[g](\x') = k_\calX(\x', \bX) k_\calX(\bX, \bX)^{-1} g(\bY)
\end{split}
\end{align}

For a more complicated case regarding how the sample observations $\bX = \{\x_i\}_{i=1}^{n}, \bY = \{\y_i\}_{i=1}^{m}$ are generated. Suppose that for a given $\x_i$, we have $y_1^{i}, \cdots, y_{m_i}^{i} \sim \pi_{Y|X=\x_i}$ and $\bY^i = [\y_1^{i}, \cdots, y_{m_i}^{i}]^\top$. Now estimating the integral of $\int g d\pi_{Y|X=\x_i}$ on $m_i$ function values $g(\bY^i)$ is a standard Bayesian quadrature problem that has been discussed in Section \ref{sec:bq}.

\begin{align*}
    \widehat{\int g d\pi_{Y|X=\x_i}} = \int k_\calY(\y, \bY^i) d\pi_{Y|X=\x_i} k_\calY(\bY^i, \bY^i)^{-1} g(\bY^i) 
\end{align*}
To summarize, the estimate of the conditional Bayesian quadrature is
\begin{align}\label{eq:CBQ_1}
    \Pi[g](\x') = k_\calX(\x', \bX) k_\calX(\bX, \bX)^{-1} [\widehat{\int g d\pi_{Y|X=\x_1}}, \widehat{\int g d\pi_{Y|X=\x_2}}, \cdots, \widehat{\int g d\pi_{Y|X=\x_n}}]^\top
\end{align}


To express everything in the form of BQ weights, we rewrite \eqref{eq:CBQ_1} in the following form.
\begin{align}
    \Pi[g](\x') = \sum_{i=1}^{n} v_i \sum_{j=1}^{m_i} w_{i, j} g(\y^i_j)
\end{align}
where 
\begin{align*}
    \bv &= [v_1, \cdots, v_{n}] \in \R^{1 \times n}, \quad \bv = k_\calX(\x',  \bX)k_\calX(\bX, \bX)^{-1} \\ 
    \bw_i &= [w_{i, 1}, \cdots, w_{i, m_i}] \in \R^{1 \times m_i}, \quad \bw_i = \int k_\calY(\y, \bY^i) d\pi_{Y|X=\x_i} k_\calY(\bY^i, \bY^i)^{-1}
\end{align*}

\subsubsection{A Complete Regression Perspective}
There are two regression problems in total. To simplify the formula, we denote $\int k_\calY(\bY^i, \y) d\pi_{Y|X=\x_i}$ as $\Phi_i \in \R^{m_i \times 1}$.

The first regression problem is 

\begin{align}
    \argmin_{F_i \in \calH_\calY} \sum_{j=1}^{m_i} |F_i(\y_j^i) - \calU_{Y|X=\x_i}(\y_j^i)|^2, \quad \forall i \in \{1,2,\cdots,n\}
\end{align}
We have the optimal solution
\begin{align}
    F_i^\dagger(\cdot) = \Phi_i^\top k_\calY(\bY^i, \bY^i) k_\calY(\bY^i, \cdot) 
\end{align}

The second regression problem is 
\begin{align}
    \argmin_{F \in \calH_\Gamma} \sum_{i=1}^n \norm{F(\x_i) - F_i^\dagger}{\calH_\calY}^2
\end{align}
We have the optimal solution 
\begin{align}\label{eq:sol_1}
    F^\dagger(\cdot) = k_\calX(\cdot, \bX) k_\calX(\bX, \bX)^{-1} \left[F_1^\dagger, \cdots, F_n^\dagger\right]^\top
\end{align}
The estimate of the conditional expectation becomes
\begin{align}]\label{eq:sol_2}
\begin{split}
    \Pi[g](\x') &= \PSi{g, F^\dagger(\x')}{\calH_\calY} \\
    &= k_\calX(\x', \bX) k_\calX(\bX, \bX)^{-1} \left[\begin{array}{c} \int k_\calY(\bY^1, \y) d\pi_{Y|X=\x_1} k_\calY(\bY^1, \bY^1)^{-1} g(\bY^1) \\ \vdots \\ \int k_\calY(\bY^n, \y) d\pi_{Y|X=\x_n} k_\calY(\bY^n, \bY^n)^{-1} g(\bY^n)
    \end{array}\right]_{n \times n}
\end{split}
\end{align}

\subsection{A GP Perspective.}
We have a probability space $(\Omega, \calF, \mathbb{P})$, then we define two random variables $X: \Omega \to \calX$ and $Y: \Omega \to \calY$. 
The integral of interest is
\begin{align}
    \Pi[g](\x) = \int g d\pi_{Y|X=\x}
\end{align}
The observed samples are $\bX = \{\x_i\}_{i=1}^{n}$ and for a given $\x_i$, we draw $m_i$ samples from the conditional distribution $\pi_{Y|X=\x_i}$. So we have $y_1^{i}, \cdots, y_{m_i}^{i} \sim \pi_{Y|X=\x_i}$ and we denote $\bY^i = [\y_1^{i}, \cdots, y_{m_i}^{i}]^\top$.

For every $\x_i$, we put a GP prior $\GP(0, k_\calY)$ on $g: \calY \to \R$ and condition on observations $(\y_1^i, g(\y_1^i)), \cdots, (\y_{m_i}^i, 
g(\y_{m_i}^i))$. Then we obtain posterior of $g$ denoted as $g_i^{posterior}$ with mean and covariance
\begin{align*}
    \mu[g_i^{posterior}](\y) &= k_\calY(\y, \bY^i)  k_\calY(\bY^i, \bY^i)^{-1} g(\bY^i), \\
    \V[g_i^{posterior}](\y, \y') &= k_\calY(\y, \y') - k_\calY(\y, \bY^i) k_\calY(\bY^i, \bY^i)^{-1} k_\calY(\bY^i, \y')
\end{align*}
Here we assume the noiseless GP but we might add a small perturbation for practical reasons. 

Then the expectation of posterior distribution $g_i^{posterior}$ under conditional distribution $\pi_{Y|X=\x_i}$ follows a Gaussian distribution.
We denote the expectation as $\Pi[g_i^{posterior}] = \int g_i^{posterior} d\pi_{Y|X=\x_i}$ and its mean and variance are

\begin{align*}
    \mu[\Pi[g_i^{posterior}]] &= \Phi_i^\top k_\calY(\bY^i, \bY^i)^{-1} g(\bY^i), \\
    \V[\Pi[g_i^{posterior}]] &= \varphi_i - \Phi_i^\top k_\calY(\bY^i, \bY^i)^{-1} \Phi_i
\end{align*}
where $\Phi_i = \int k_\calY(\bY^i, \y) d\pi_{Y|X=\x_i}$ and $\varphi_i = \int \int k_\calY(\y', \y) d\pi_{Y|X=\x_i}(\y') d\pi_{Y|X=\x_i}(\y)$.

\fxb{Actually, if we only have one function $g$, why are we using different GPs? Surely we can just combine the function evaluations into one GP, then integrate that one GP under the distribution we actually want to integrate against? This is interesting because it shows a difference between the kernel ridge regression perspective and this GP perspective. I think the kernel-ridge regression perspective could give you an optimal quadrature rule given we want to integrate any function $g \in \mathcal{H}_y$, rather than a fixed function $g$. } \masha{nice! I think just replacing evaluations $\bY^i$ above with $\bY=\{y_1^1, \dots y_{m_1}^1, y_1^2, \dots y_{m_2}^2, \dots y_1^n, \dots y_{m_n}^n\}$ should do it, no other modifications needed. One catch is that $\bY$ might be very large, so we might want to consider sparse approximations of the inverse Gram matrix $k_\calY(\bY, \bY)^{-1}$, eg the Nyström approximation.}

Note that for every $\x_i$, we have a Gaussian distribution over $\Pi[g_i^{posterior}]$ with mean $\mu[\Pi[g_i^{posterior}]]$ and variance $\V[\Pi[g_i^{posterior}]]$. 
We put a GP prior $f_x \sim \GP(0, k_\calX)$ on $f_x: \calX \to \R$ and condition on observations $(\x_1, \mu[\Pi[g_1^{posterior}]]), \cdots, (\x_n, \mu[\Pi[g_n^{posterior}]])$ with heteroskedastic noise $\sigma_1 = \sqrt{\V[\Pi[g_1^{posterior}]]}, \cdots, \sigma_n = \sqrt{\V[\Pi[g_n^{posterior}]]}$. Equivalently, we assume $f_x(\x_i) = \mu[\Pi[g_1^{posterior}]] + \epsilon_i$ with Gaussian noise $\epsilon_i \sim \calN(0, \sqrt{\V[\Pi[g_i^{posterior}]]})$. Then we can obtain the posterior distribution $f_x^{posterior}$ with mean and variance
\begin{align}
    \mu[f_x^{posterior}](\x) &= k_\calX(\x, \bX) (k_\calX(\bX, \bX) + \Sigma)^{-1} [\mu[\Pi[g_1^{posterior}]], \cdots, \mu[\Pi[g_n^{posterior}]]]^\top, \\
    \V[f_x^{posterior}](\x, \x') &= k_\calX(\x, \x') - k_\calX(\x, \bX) (k_\calX(\bX, \bX) + \Sigma)^{-1} k_\calX(\bX, \x')
\end{align}
where $\Sigma = \operatorname{diag}(\sigma_1, \cdots, \sigma_n) \in \R^{n \times n}$.

The estimate of the conditional expectation at $\x_i$ is a Gaussian distribution with mean $\mu[f_x^{posterior}](\x')$ and variance (uncertainty estimate) $\V[f_x^{posterior}](\x', \x')$.


% We want to minimize the error
% \begin{align}
%     \norm{C_{YX} - \widehat{C_{YX}}}{\calH_\calY \otimes \calH_\calX}^2
% \end{align}
% where 
% \begin{align*}
%     C_{YX} &= \E_{YX}[k_\calY(Y, \cdot) \otimes k_\calX(X, \cdot)], \\
%     \widehat{C_{YX}} &= \sum_{i=1}^n \sum_{j=1}^{m_i} w_{ij} k_\calY(y_j, \cdot) \otimes k_\calX(x_i, \cdot)
% \end{align*}

\subsection{Worst case error}
Recall that we have defined a RKHS $\calH_\Gamma$ for functions $F: \calX \to \calH_\calY$ with reproducing kernel $k_\Gamma: \calX \times \calX \to \calH_\calY$. First, we prove a lemma.

\begin{lemma}
    For any $\x_1, \cdots, x_m \in \calX$, and any scalar $c_1, \cdots, c_m \in \R$, we have $\forall g \in \calH_\calY$,
    \begin{align*}
        \norm{\sum_{i=1}^m c_i k_\Gamma(\cdot, \x_i)(g)}{\calH_\Gamma} = \sup_{F \in \calH_\Gamma, \norm{F}{} \leq 1} \PSi{\sum_{i=1}^m c_i F(\x_i), g}{\calH_\Gamma}
    \end{align*}
\end{lemma}
\textbf{Proof:} By the reproducing property, the right hand side can be written as
\begin{align*}
    \sup_{F \in \calH_\Gamma, \norm{F}{} \leq 1} \PSi{\sum_{i=1}^m c_i F(\x_i), g}{\calH_\calY} = \sup_{F \in \calH_\Gamma, \norm{F}{} \leq 1} \PSi{F, \sum_{i=1}^m c_i k_\Gamma(\cdot, \x_i)(g)}{\calH_\Gamma}
\end{align*}
And with the Cauchy-Schwarz inequality, we have
\begin{align*}
    \sup_{F \in \calH_\Gamma, \norm{F}{} \leq 1} \PSi{F, \sum_{i=1}^m c_i k_\Gamma(\cdot, \x_i)(g)}{\calH_\Gamma} \leq \norm{\sum_{i=1}^m c_i k_\Gamma(\cdot, \x_i)(g)}{\calH_\Gamma}
\end{align*}
On the other hand, we choose $F$ to be $F=\frac{1}{C} \sum_{i=1}^m c_i k_\Gamma(\cdot, \x_i)(g)$ with $C$ a normalizing constant such that $\norm{F}{\calH_\Gamma}=1$. Then we have
\begin{align*}
    \sup_{F \in \calH_\Gamma, \norm{F}{} \leq 1} \PSi{F, \sum_{i=1}^m c_i k_\Gamma(\cdot, \x_i)(g)}{\calH_\Gamma} &\geq \PSi{\frac{1}{C} \sum_{i=1}^m c_i k_\Gamma(\cdot, \x_i)(g), \sum_{i=1}^m c_i k_\Gamma(\cdot, \x_i)(g)}{\calH_\Gamma} \\ &= \norm{\sum_{i=1}^m c_i k_\Gamma(\cdot, \x_i)(g)}{\calH_\Gamma}
\end{align*}
Combing the above two inequalities, the lemma is proved.

We can see from \eqref{eq:sol_1} that our estimator is a weighted average of $F_i^\dagger \in \calH_\calY$, and the weights are $\w(\x) = \left[\w_1(x), \cdots, \w_n(\x) \right] = k_\calX(\x, \bX) k_\calX(\bX, \bX)^{-1} \in \R^{1 \times n}$, so we know from the lemma that
\begin{align*}
    \norm{k_\Gamma(\x, \cdot)(g) - \sum_{i=1}^n \w_i(\x) k_\Gamma(\x_i, \cdot)(g)}{\calH_\Gamma} = \sup_{F \in \calH_\Gamma, \norm{F}{} \leq 1} \PSi{F(\x) - \sum_{i=1}^m \w_i(\x) F(\x_i), g}{\calH_\Gamma}
\end{align*}
Then we have
\begin{align}
    \sup_{g \in \calH_\calY, \norm{g}{}\leq 1}\norm{k_\Gamma(\x, \cdot)(g) - \sum_{i=1}^n \w_i(\x) k_\Gamma(\x_i, \cdot)(g)}{\calH_\Gamma} = \sup_{\substack{g \in \calH_\calY, \norm{g}{}\leq 1, \\ F \in \calH_\Gamma, \norm{F}{} \leq 1}} \PSi{F(\x) - \sum_{i=1}^m \w_i(\x) F(\x_i), g}{\calH_\Gamma}
\end{align}
The right hand side is equal to
\begin{align*}
    RHS = \norm{F(\x) - \sum_{i=1}^n \w_i(\x) F(\x_i)}{\calH_\Gamma}
\end{align*}
Note that $\sum_{i=1}^n \w_i(\x) F(\x_i)$ is exactly our predictor in \eqref{eq:sol_1} and the norm measures the largest error between any function $F \in \calH_\Gamma$ and our predictor $F^\dagger$, i.e worst case error.

The left hand side is computable. Recall that $\left[\w_1(x), \cdots, \w_n(\x) \right] = k_\calX(\x, \bX) k_\calX(\bX, \bX)^{-1}$.
\begin{align*}
    LHS &= \sup_{g \in \calH_\calY, \norm{g}{}\leq 1} k_\calX(\x, \x) \PSi{g, g}{\calH_\calY} - \sum_{i, j =1}^n \w_i(\x) \w_j(\x) k_\calX(\x_i, \x_j) \PSi{g, g}{\calH_\calY} \\ &= k_\calX(\x, \x) - \w(\x) k_\calX(\bX, \bX) \w(\x) \\
    &= k_\calX(\x, \x) - k_\calX(\x, \bX) k_\calX(\bX, \bX)^{-1} k_\calX(\bX, \x) \in \R
\end{align*}
We see that the find worst case error only depends on $\x$ and is independent of $\y$ and $F_i^\dagger$.
