

\section{Background}\label{sec:background}

Need to be explicit that $f(x,\theta)$ is in $L^2$ for all $\theta$.



\fxb{Here is a list of applications: Quantiles of conditional expectations: \cite{Lee1998}, conditional Monte Carlo \cite{Lindqvist2022}}



Let  $X$ and $Y$ be random variables taking values in the measurable spaces $\calX$ and $\calY$ respectively. Function $f: \calY \to \R$ is a measurable function with finite second moment. 
We assume that the density form of the joint distribution is known as $p(x, y)$ and so the marginals are $p(x)$ and $p(y)$. 
We also assume that the conditional density $p(y|x)$ exists for all $x\in \calX$. 
We are interested in computing the conditional expectation of some function $f: \calY \to \R$ at any given value $x$.
\begin{align}
    \Pi[f](x) = \E[f(Y)|X = x] = \int_\calY f(y) p(y|x) dy, \quad g(x) \equiv \Pi[f](x)
\end{align}
In the following text, we use $\Pi[f](x)$ when we want to emphasize that the conditional expectation is an integral on $f$, and we use $g(x)$ when we want to emphasize that the conditional expectation is a function $g:\calX \to \R$.

In our setting, we consider a more general case where we observe multiple samples of $Y$ for each $X=x_i, $ so the observations are denoted as $\{(x_i, y_{i, 1}, \cdots, y_{i, s_i})\}_{i=1}^n$ with $\{x_i\}_{i=1}^n \sim p(x)$ and $\{y_{i,j}\}_{j=1}^{s_i} \sim p(y|x_i)$ and $\sum_{i=1}^n s_i = s$, $s$ is the total number of samples. 
We introduce vectorized notation by denoting $\bX = [x_1, \cdots, x_n]^\top \in \R^n$ and $\bY^i = [y_{i, 1}, \cdots, y_{i, s_i}]^\top \in \R^{s_i}$ and $\bY = [{\bY^1}^\top, \cdots, {\bY^n}^\top]^\top \in \R^s$.
$f(\bY^i) \in \R^{s_i}$ and $f(\bY) \in \R^s$ are defined similarly.

Since standard Monte Carlo is not applicable in our setting, importance sampling is probably the first method to come in mind for computing conditional expectation. 
Least square Monte Carlo is a widely used approach in the area of option pricing to estimate conditional expectation~\cite{longstaff2001valuing}. Regression and quadrature methods are also widely used to approximate conditional expectation.
For the remaining of this section, we will introduce these methods and then provide a motivation for our method.


\cite{Sun2023}

\subsection{Importance sampling}
In importance sampling, the conditional expectation can be written like this
\begin{align}
\begin{aligned}
    \E[f(Y)|X=x] = \int_\calY f(y)p(y|x)dy = \int_\calY p(y|x_i) f(y)\frac{p(y|x)}{p(y|x_i)} dy \approx \frac{1}{s_i} \sum_{j=1}^{s_i} f(y_{i,j})\frac{p(y_{i,j}|x)}{p(y_{i,j}|x_i)}
\end{aligned}
\end{align}
Although importance sampling is able to use the samples $y_{i,j}$ from another conditional distribution $p(y|x_i)$ and is an unbiased estimator, it fails to take into account the structural information of the space $\calX$ to utilize the similarity of $p(y|x)$ and $p(y|x_i)$. 
In most of the cases, we do not know the exact density form of $p(y|x)$ so importance sampling is not widely applicable. 
And importance sampling usually demonstrates very high variance.
% Pareto smoothed importance sampling~\cite{vehtari2015pareto} stabilizes the estimate by modelling the tail of importance ratios.
% Black box importance sampling~\cite{liu2017black} uses a stein kernelized approach so it doesn't require the density form of conditional distribution $p(y|x)$.



\subsection{Bayesian Quadrature}


\cite{Sun2021}

Bayesian quadrature (BQ) has a very long history, it was originally introduced in \citep{OHagan1991BayesHermiteQ} and recently revisited by \cite{fx_quadrature}. 
BQ aims at finding the optimal weight $w_{i,j}$ for every function evaluation $f(y_{i,j})$ so 
$\Pi[f](x)$ can be approximated by
\begin{align}\label{eq:weights}
    \widehat{\Pi[f]}(x) = \sum_{i=1}^n \sum_{j=1} ^{s_i} w_{i, j}^{BQ} f(y_{i, j})
\end{align}
Note that in our setting, when $w_{i, j} = \frac{1}{m}$, we no longer recover the standard Monte Carlo estimate because the samples $y_{i, j}$ are not drawn from the conditional distribution $p(y|x)$.
As a result, the extended methods based on Monte Carlo like Quasi-Monte Carlo (QMC) or randomized QMC~\citep{owen2000monte} are also not applicable. 
We introduce BQ from the Gaussian process regression perspective following \citep{fx_quadrature} under slight modification to adapt to our setting of computing conditional expectation. 

We consider a Gaussian process $\mathfrak{f}:\calY \times \Omega \to \R$ which can be characterized by the mean function $m(y)=\E_\omega[\mathfrak{f}(y, \omega)]$ and the covariance function $k_\calY(y, y') = \E_\omega[(\mathfrak{f}(x, \omega) - m(x))(\mathfrak{f}(x', \omega) - m(x'))]$. We assume that a Gaussian process prior is placed on the integrand $f$, i.e $f \sim \GP(0, k_\calY)$. Throughout this paper, we assume without loss of generality that all Gaussian process priors have zero means. Then we can obtain the posterior $\bar{f}$ conditioned on all m observations $\{f(y_{i, j})\}$, which is also a Gaussian process with mean $\bar{m}$ and covariance $\bar{c}$ of the form~\citep{GPML}:
\begin{align}
\begin{aligned}
\bar{m}(y) & = k_\calY(y, \bY) k_\calY(\bY, \bY)^{-1} f(\bY) \\
\bar{c}(y, y') & = k_\calY(y, y') -k_\calY(y, \bY) k_\calY(\bY, \bY)^{-1} k_\calY(\bY, y')
\end{aligned}
\end{align}

As linear projections preserve normality, the posterior $\bar{f}$ taken through the integral with respect to $p(y|x)$ is a Gaussian distribution with mean and covariance:
\begin{align}\label{eq:bq}
\begin{aligned}
\widehat{\Pi[f]}(x) = \E_\omega \left[ \Pi[\bar{f}(\cdot, \omega)] (x) \right] = \int_\calY \bar{m}(y) p(y|x_i) dy = \Phi^\top k_\calY(\bY, \bY)^{-1} f(\bY) \\
\mathbb{V}_\omega \left[ \Pi[\bar{f}(\cdot, \omega)] (x) \right] = \int_\calY \int_\calY \bar{c}(y, y') p(y|x)p(y'|x) dydy' = \varphi - \Phi^\top k_\calY(\bY, \bY)^{-1} \Phi
\end{aligned}
\end{align}
where $\Phi = \int k_\calY(y, \bY) p(y|x) dy \in \R^s$ and $\varphi = \int k_\calY(y, y') p(y|x) p(y'|x) dydy' \in \R$. For the remaining of the paper, we will abbreviate $\E_\omega \left[ \Pi[\bar{f}(\cdot, \omega)] (x) \right]$ as $\E \left[ \Pi[\bar{f}] (x) \right]$.

The mean in Equation \eqref{eq:bq} is essentially the BQ estimate for conditional expectation so we can derive the optimal BQ weights in Equation \eqref{eq:weights} as $\left[w^{BQ}_{1,1}, \cdots, w^{BQ}_{1, {s_1}}, \cdots, w^{BQ}_{n, {1}}, \cdots w^{BQ}_{n, {s_n}}\right]^\top = \Phi^\top k_\calY(\bY, \bY)^{-1} \in \R^s$. 

BQ provides an simple expression for the estimate as well as the uncertainty, so it has been widely used in various scenarios. 
In reinforcement learning, BQ can provide 
uncertainty which can be incorporated into the acquisition function to enable more efficient exploration~\citep{paul2018alternating}. 
BQ has also been extended to multi-output case to compute multiple related integrals~\citep{xi2018bayesian, gessner2020active}, to multi-fidelity models~\citep{li2022multilevel} and to Riemannian data manifolds~\citep{frohlich2021bayesian_riemann}. 


However, BQ has the following practical limitations which largely limits the application in reality. Firstly, BQ requires the inversion of $k_\calY(\bY, \bY) \in \R^{s \times s}, \quad s = \sum s_i$. If every $s_i$ are of similar scale, then the complexity  $\calO(s^3) = \calO(n^3 {s_i}^3)$ is prohibitive. Secondly, BQ requires the knowledge of the analytic forms of $\Phi$ and $\varphi$ (also known as the kernel mean embedding~\citep{muandet2017kernel}), which is a very prohibitive restriction in practice. Thirdly, for large models like energy-based models we only know the conditional density $p(y|x)$ up to its normalizing constant, which makes deriving the analytic form of $\Phi$ completely impossible.
We are going to propose conditional Bayesian quadrature that addresses all of these drawbacks of standard Bayesian quadrature in Section 3.

\subsection{Regression}
From a regression perspective, the conditional expectation $\E[f(Y)|X]$ is the optimal predictor in the space of all square integrable functions ($L^2(X)$) with the smallest expected mean squared error~\citep{granger2014forecasting}. 
\begin{align}
    \E[f(Y)|X] = \argmin_{h \in L^2(X)} \E(f(Y) - h(X))^2
\end{align}

Therefore, the optimal predictor $h^\ast$ that minimizes the empirical mean squared error can be regarded as an approximation to the conditional expectation. In our setting, we observe multiple samples $\{y_{i,j}\}_{j=1}^{s_i}$ from the conditional distribution $p(y|x_i)$, which is different from the usual regression setting that $(x_i, y_i)$ pairs are drawn from the joint distribution (i.e $s_i=1$ for all $i$). 
It is straightforward to simply average out the function evaluations $\{f(y_{i, j})\}_{j=1}^{s_i}$, which is a widely used approach in the area of option pricing called least square Monte Carlo~\citep{Alfonsi2022, alfonsi2021multilevel, longstaff2001valuing}. 
\begin{align}
     h^\ast = \argmin_{h \in \calH} \sum_{i=1}^n  \left(\frac{1}{s_i} \sum_{j=1}^{s_i} f(y_{i, j}) - h(x_i)\right)^2 + \lambda \norm{h}{\calH}^2, \quad \widehat{\Pi[f]}(x) = h^\ast(x)
\end{align}
Regularization constant $\lambda$ is added to control the function complexity.

For parametric regression, $h$ is parameterized by a vector $\theta$ and $\calH$ is the family of such functions $h$. For polynomial functions of order $t$, $\theta$ would be a $t$ dimensional vector. For deep neural networks, $\theta$ would be weights and biases in all the network layers. For parametric models, the function norm complexity constraint $\norm{h}{\calH}$ is usually replaced by parameter norm $\norm{\theta}{}$. The optimal $\theta^\ast$ is found by applying gradient-based methods like SGD to minimize the empirical mean squared error. 

For non-parametric regression like kernel ridge regression (KRR), the family of functions $\calH$ is a reproducing kernel Hilbert space (RKHS) associated with a reproducing kernel $k_\calX: \calX \times \calX \to \R$. Based on the Reisz representor theorem, the optimal $h^\ast$ can be represented as a finite linear combination of feature maps $k_\calX(x_i, \cdot)$. As a result, the optimal solution $h^\ast$ has an analytic expression~\cite{berlinet2011reproducing}.

% Under Riesz representer theorem, $\widehat{\Pi_{f, reg}}$ has a closed-form solution
% \begin{align}
%     \widehat{\Pi_{f, reg}}(x) = h^\ast(x) = k_\calX(x, \bar{\bX}) (k_\calX(\bar{\bX}, \bar{\bX}) + \lambda \calI)^{-1} f(\bY)
% \end{align}
% where $\bar{\bX} = [\underbrace{x_1, \cdots, x_1}_{m_1}, \cdots, \underbrace{x_n, \cdots, x_n}_{m_n}] \in \R^s$.

Regression methods typically have lower time complexity than Bayesian quadrature. Polynomial regression has complexity of $\calO(t^3)$, kernel ridge regression has complexity of $\calO(n^3)$. If gradient based methods are used, the time complexity can be even as low as $\calO(n)$ thanks to auto-differentiation. However, regression methods have several drawbacks. Firstly, regression methods are not able to quantify the uncertainty of the estimate. Secondly, the regression literature does not give enough attention to the setting where multiple $y_{i,j}$ are being observed given a single $x_i$. Thirdly, simply averaging out all $y_{i,j}$ is less data-efficient compared to Bayesian quadrature as we will show in Section \ref{sec:cbq}.

We note that quadrature literature and regression literature have approached the same target of estimating conditional expectation with very different ideas. The Bayesian quadrature community exploits prior information on $\calY$ by introducing a covariance function $k_\calY$, while the (kernel) regression community exploits prior information on $\calX$  by introducing a reproducing kernel $k_\calX$ or using flexible enough models like deep neural networks. In the next section, we are going to show that a new estimator that exploits prior beliefs for both $\calX$ and $\calY$ will greatly improve the performances in terms of convergence speed.

\subsection{Stein kernel}

The Stein operator was first introduced in \cite{stein1972bound} and has been widely used in machine learning \cite{liu2020finding, chwialkowski2016kernel, anastasiou2023stein, liu2016stein}. 
Suppose we have a distribution with density $p(x)$ and a function $f(x)$ with the property that $\lim_{n \to \infty} p(x)f(x) = 0$.
We can define the Stein operator $T_p$ acting on function $f$ and obtain the Stein identity.
\begin{align}
    T_p[f](x) = f(x) \nabla_x \log p(x)+\nabla_x f(x), \quad \E_p[T_p[f](x)] = 0
\end{align}

As a result, for any positive definite kernel $k: \calX \times \calX \to \R$, we can obtain a Stein kernel by applying the Stein operator on both arguments of the kernel $k$.
\begin{align}
\begin{aligned}
    k_p(x, x') = T_p^x[T_p^{x'}[k(x, x')]] = &\nabla_x \log p(x) k(x, x' ) \nabla_{x'} \log p(x') + \nabla \log p(x) \nabla_{x'} k(x, x')\\ 
    &+ \nabla \log p(x') \nabla_x k(x, x') + \nabla_x \nabla_{x'} k(x, x')
\end{aligned}
\end{align}

It has been shown that the Stein kernel is positive definite and universal. Stein identity indicates that the kernel mean embedding of equals to 0, i.e $\int k(x, x')p(x) = 0$, which is very useful for Bayesian quadrature in Equation \eqref{eq:bq} as we will show next. The similar ideas have also been used in \cite{oates2017control, liu2016stein, liu2017black}.