
\section{Introduction}\label{sec:introduction}


\fxb{I think we might want to be even more explicit about what we mean by a conditional expectation. Here we mean that it is an expectation which depends on a conditioning variable, and we want to do a good job at approximating this expectation for any value of the conditioning variable. We should also explain that in those situations, we often only observe data for a few values of the variables on which we condition, but we would ideally want to be able to approximate it for any value of this conditioning variable. }

Computing conditional expectations is a popular task in many machine learning applications. In reinforcement learning, the return is the expected accumulated reward given action and state~\cite{sutton2018reinforcement}. In causal inference, the conditional average treatment effect is the expected potential outcome given some covariate~\citep{hernan2010causal}. We focus on the case to compute the conditional expectation of a integrand with respect to a conditional distribution where the number of available evaluations of the integrand is limited because 1) functional evaluation requires expensive computer simulations 2) the observations are rare. 
As a result, a data-efficient numerical algorithm to compute conditional expectation from a limited number of samples for an expensive integrand becomes increasingly important. \fxb{It is not clear if the examples you gave actually satisfy these conditions. Would be good to comment on that, and to focus solely on examples where this will hold. We also need to convince the reader that this is an important problem to consider.}


\fxb{The paragraph below is quite long - and I am not sure it is really relevant for an introduction on conditional expectations given you say it is not really applicable for this problem. I would suggest you shorten to one sentence, and move the rest to the background. More generally, it is important to keep the introduction short and very much to the point.} \fxb{Overall I would suggest only having a single paragraph for existing methods}
Monte Carlo, as the most standard approach to approximate integrals, is not suitable for approximating conditional expectation. 
Firstly, compared to standard expectation which is always a real value, conditional expectation is  actually a function that varies when the value of the conditioned random variable changes.
Secondly, even if we are only interested in the conditional expectation at a certain value, the value to be conditioned on is often not observed so that naively averaging the samples will lead to a biased estimate. 
Thirdly, even if the value to be conditioned on is actually being observed, standard MC is very inefficient because it ignores samples from other conditional distributions and only uses a small fraction of total samples.
The reason that standard Monte Carlo fails is also a manifestation that computing conditional expectation is more challenging than computing standard expectation.

% \fxb{When you make a claim, you should always try to justify it. Why is it a challenging task? Are you saying this is harder than standard expectations? If so, why?} \fxb{Try also to be more precise about your statements: why do we need to compute conditional expectations in reinforcement learning etc? Give concrete examples. (I know you might not know these concrete examples yet, but in that case spend some time doing a literature review to understand the lay of the land)}

% \fxb{I think it would be worth saying that integration methods for standard integrals can be used, but that they might be sub-optimal, and we should explain why. For example, why not use standard MC or importance sampling here? What do we miss out on by using that, and why did people feel the need to develop more advanced methods?}

The first method that comes to mind is usually importance sampling~\citep{tokdar2010importance}, where other conditional distributions can be regarded as proposal distributions so that we can use observed samples from other conditional distributions and weight them accordingly to correct for the bias. 
The drawback of importance sampling is its high variance, and also importance sampling ignores the relation between other conditional distributions and the target distribution. 

\fxb{Again, the paragraph below looks good for a background but is too long and not enough to the point. I would suggest having only one sentence or two on regression methods, and also having a clear explanation of why these are helpful and what their limitations are. Currently the advantages/limitations are not clearly highlighted.}
Regression is also a widely-used approach in approximating conditional expectation with a flexible choice of different models.
It is well known that the optimal predictor that minimizes the mean squared error is the conditional expectation~\citep{mendenhall2012introduction_prob}, so the optimal predictor under empirical mean squared error can be a good estimate to the conditional expectation.
For example, kernel ridge regression~\citep{berlinet2011reproducing} 
assumes that the conditional expectation belongs to a reproducing kernel Hilbert space and minimizes the empirical mean squared error with certain regularization. 
Polynomial regression~\citep{Alfonsi2022, alfonsi2021multilevel} assumes that the conditional expectation belongs to a family of polynomial functions.
Over-parameterized models like deep neural network are becoming more popular to model conditional expectation due to its flexibility but it only works well when sample size is large~\citep{hartford2017deepiv}.
% Regression methods fail to generalize to the setting where multiple samples are observed in one conditional distribution.
% \fxb{Explain at a high-level how these methods use either deep networks or polynomials. What are these models doing? Why is it helping when it comes to estimating the expectations? The reader will probably not know any of these methods so you need a clear high-level description.} 

\fxb{Discussion of traditional quadrature is not needed for an introduction} \fxb{I would suggest starting the paragraph that saying very clearly that we will propose a new Bayesian quadrature algorithm for conditional expectations. Otherwiae the reader is not super clear on why you are discussing these.} Quadrature method belongs to the family of computational statistics and focuses on estimating integrals that have no analytic expressions. 
Traditional quadrature methods like trapezoid quadrature rule and Bayes Hermite quadrature~\citep{bayes_hermite1991bayes} were studied in the last century.
Recently, a lot of papers have pointed out a probabilistic interpretation of classical quadrature methods under the term Bayesian quadrature~\citep{hennig2015probabilistic}.
Bayesian quadrature first places a prior on the integrand, and then obtains the posterior given evaluations of the integrand, which is finally propagated through the integral to provide a full distribution of the integral.
For example, Bayesian quadrature with a linear spline prior is equivalent to trapezoidal quadrature rule~\citep{dragomir1999some}.
Bayesian quadrature has been proved to have faster convergence rate \citep{fx_quadrature, kanagawa2020convergence} and provides uncertainty quantification. 
A frequentist perspective of Bayesian quadrature can be found in \cite{kanagawa2018gaussian}.
Unfortunately, the application of Bayesian quadrature is rather limited because it requires a closed-form expression for the kernel mean embedding, which places a huge restriction on the choice of the integrand and conditional distribution of interest. \fxb{I would keep this for the limitations section of the paper, or for the bit where you introduce Stein kernels}

% It has also been extended to the case where multiple related integrals are computed together~\cite{xi2018bayesian}.

We make the observation that regression method exploits the structure of conditioned random variables and Bayesian quadrature exploits the prior information on the integrand.
In this paper, we combine the advantages from these two different literature and we propose a novel conditional Bayesian quadrature approach that demonstrates faster convergence rate to the true value compared to other baseline methods in computing conditional expectation. 
We also propose to use Stein kernel so that the kernel mean embedding becomes a constant function, which greatly improves the applicability of Bayesian quadrature. 
Our approach is also able to capture the full distribution which provides quantification of the uncertainty arising from the computation~\cite{hennig2015probabilistic}, which is crucial to the robustness and reliability for the application of numerical integration methods.  
\fxb{I would add a sentence or two highlighting why your proposed method works well to tackle the two challenges highlighted at the very top of the paper. This would 'close the loop' on the story; i.e. you highlight a problem, explain why existing methods are not good enough, then explain very explicitly why your method is better.}

We summarize our contributions as follows:
\begin{itemize}
    \item We propose conditional Bayesian quadrature which shows faster convergence rate than baseline methods.
    \item Conditional Bayesian quadrature can provide a quantification of epistemic uncertainty.
    \item We use Stein kernel to allow for a richer class of integrands and conditional probability, which improves the applicability of all Bayesian quadrature methods.
\end{itemize}


