\section{Theoretical Results}\label{sec:theory}


Our main result in \Cref{thm:convergence} below guarantees that CBQ is able to approximate $I(\theta)$ when $T$ grows. To derive the result, we combine existing results on the convergence of GP interpolation from \citep{wynne2021convergence}, with results on importance-weighted kernel ridge regression from \citep{gogolashvili2023importance}; see~\Cref{appendix:convergence_rate} for the proof.

The result of this theorem depends on the smoothness of the problem. We will say a function has smoothness $s$ if it is in the Sobolev space of functions with at least $s$ (weak) derivatives that are square Lebesgue-integrable \citep{adams2003sobolev}. Similarly, we will say a kernel has smoothness $s$ whenever its corresponding RKHS is a space of functions of smoothness $s$. This is for example the case of the Mat\'ern$-\nu$ kernel in dimension $d$ whenever $s= \nu +d/2$, defined as $k_\nu(x,y) = \frac{\eta}{\Gamma(\nu)2^{\nu - 1}} (\frac{\sqrt{2 \nu}}{l} \| x - y \|_2 )^\nu K_\nu(\frac{\sqrt{2 \nu}}{l} \| x-y \|_2)$ where $K_\nu$ is the modified Bessel function of the second kind and $\eta,l >0$ are hyperparameters. 

% \fxb{I think we should change A2 to allow for any distribution on $\Theta$ with full support. This would then allow us to say that A2 is satisfied for all of the experiments. Currently the finance example uses lognormal and the health economics example uses Gaussian (all others are uniform).} \masha{Done.}

% Furthermore, to quantify smoothness in a way specific to the task, we use the \emph{source condition} of~\citep{gogolashvili2023importance} for $r \in [1/2, 1]$. This is a standard condition in the kernel methods literature that compares the smoothness of $I(\theta)$ to the least smooth function in the RKHS of $k_\Theta$: if $k_\Theta$ is not smoother than $I(\theta)$, the condition holds for $r=1/2$, and a larger $r$ implies smoother $I(\theta)$; see~\Cref{appendix:convergence_rate}. 

% \vspace{-2pt}
\begin{theorem}\label{thm:convergence}
Let $x \mapsto f(x, \theta)$ be a function of smoothness $s_f > d/2$, and $\theta \mapsto I(\theta)$ be a function of smoothness $s_I > p/2$. Suppose the following assumptions hold:
\vspace{-2pt}
\begin{enumerate}[itemsep=0.1pt,topsep=0pt,leftmargin=*]
\item [A1] The domains $\calX \subset \R^d$ and $\Theta\subset \R^p$ are open, convex, and bounded.
~\label{as:domains}
\item [A2] The parameters and samples satisfy: 
$\theta_{1:T} \sim \mathbb{Q}$, and $x_{1:N}^t \sim \Pb_{\theta_t}$ for all $t$. 
~\label{as:pars_and_samples}
\item [A3] $\mathbb{Q}$ has density $q$ such that $\inf_{\theta \in \Theta} q(\theta)>0$ and $\sup_{\theta \in \Theta} q(\theta) < \infty$, and $\Pb_\theta$ has density $p_\theta$ for any $\theta \in \Theta$, $\inf_{\theta \in \Theta, x \in \calX} p_{\theta}(x)>0$ and $\sup_{\theta \in \Theta}\|p_{\theta}\|_{\calL^2(\calX)}<\infty$.
~\label{as:densities}
\item [A4] The kernels $k_\calX$ and $k_\Theta$ are 
%Mat\'ern 
of smoothness $s_\calX \in (d/2, s_f]$ and $s_\Theta \in (p/2, s_I]$ respectively.~\label{as:kernels} 
\item [A5] The regularisers satisfy $\lambda_{\calX}=0$ and $\lambda_{\Theta} = \calO(T^{\frac{1}{2}})$.
\end{enumerate}
\vspace{-2pt}
Then,  we have that for any $\delta \in (0, 1)$ there is a $T_0(\delta)>0$ and an $N_0>0$ such that for any $N \geq N_0$ and $T \geq T_0$, with probability at least $1-\delta$ it holds that
\vspace{-3pt}
\begin{talign*}
        \| \hat I_\mathrm{CBQ}(\theta) - I(\theta) \|_{\calL^2(\Theta)} \leq  C_1(\delta) T^{-\frac{1}{4}} + C_2(\delta) T^{-\frac{3}{4}} N^{-\frac{2s_\calX}{d} + \varepsilon} ,
\end{talign*}
% \begin{talign*}
%         \| \hat I_\mathrm{CBQ}(\theta) - I(\theta) \|_{\calL^2(\Theta)} \leq  C_1(\delta) T^{-\frac{r}{2r+1}} + C_2(\delta) T^{-\frac{r+1}{2r+1}} N^{-\frac{2s_\calX}{d} + \varepsilon} ,
% \end{talign*}
%
for any arbitrarily small $\varepsilon>0$, and the constants $C_1(\delta)=\calO(\log(1/\delta))$ and $C_2(\delta)=\calO((1/\delta^2)\log(1/\delta))$ are independent of $N, T, \varepsilon$.
\end{theorem}
\vspace{-4pt}

We now briefly discuss our assumptions. Many of these were simplified to improve readability, in which case we highlight possible generalisations. A1 is used to guarantee the points eventually cover the domain, and could straightforwardly be generalised to any open and bounded domain with Lipschitz boundary satisfying an interior cone condition; see \citep{kanagawa2020convergence,wynne2021convergence}. A2 ensures  $\theta_{1:T}$ and $x_{1:N}^t$ cover $\calX$ and $\Theta$ sufficiently fast in probability as $N$ and $T$ grow.  The assumption on the point sets could also be straightforwardly generalised to active learning designs or grids following existing work on BQ convergence \citep{Kanagawa2019,kanagawa2020convergence,wynne2021convergence}. A3 is very weak and ensures that the points will fill $\calX$. A4 guarantees that our first and second stage GPs have the right level of regularity for the problem, although the range of smoothness values could be significantly extended following the approach of \cite{kanagawa2020convergence}. For simplicity, we also implicitly assume that the kernel hyperparameters (such as lengthscales and amplitudes) are known, but this could be extended to estimation in bounded sets; see \citep{Teckentrup2020}. Finally, A5 requires $\lambda_{\calX}=0$, but this could be relaxed at the cost of slowing down convergence (see \Cref{appendix:convergence_rate}). In contrast, growing $\lambda_{\Theta}>0$ in $T$ is natural since we work in a bounded domain and we expect the conditioning of the Gram matrix to become worse as $T \rightarrow \infty$. 
% As previously discussed, we will take $N$ large but finite, and drive $T \rightarrow \infty$. For this reason, we only need $\lambda_{\Theta}$ to grow whilst $\lambda_{\calX}$ may remain constant or be taken to be zero.

We are now ready to discuss the implications of the theorem.
Firstly, the result is expressed in probability to account for randomness in $\theta_{1:T}$ and $x_{1:N}^t$, and provides a rate of $\calO(T^{-1/4}+ T^{-3/4} N^{- 2 s_\calX/d + \varepsilon})$. We can see that growing $N$ will only help up to some extent (by making the second term approach zero), but that growing $T$ is essential to ensure convergence. This is intuitive since we cannot expect to approximate $I(\theta)$ uniformly simply by increasing $N$ at some fixed points in $\Theta$. Despite this, we will see in \Cref{sec:experiments} that increasing $N$ will be essential to improving performance in practice. Unfortunately, the rate in $N$ is significantly slowed down for large $d$, demonstrating that our method is mostly suitable for low- to mid-dimensional problems.

Although the bound is dominated by a term $\calO(T^{-1/4})$ in $T$, we provide a more general result with rate up to $\calO(T^{-1/3})$ under an additional ``source condition'' which requires stronger regularity from $f$; see \Cref{appendix:convergence_rate}. The latter rate is minimax optimal for any nonparametric regression-based method \citep{Stone1982}. We note that we cannot expect a similar result for IS since that method does not apply when $f$ depends on $\theta$. For LSMC, we also cannot guarantee consistency of the algorithm when $I(\theta)$ is not a polynomial (unless $p \rightarrow \infty$; see \cite{stentoft2004convergence}). We are not aware of any results for KLSMC, so derived a bound for it which is similar to \Cref{thm:convergence}. This bound, presented in \fxb{Appendix XXX}, is of the form $\calO(T^{-\frac{1}{4}}+ T^{-\frac{3}{4}} N^{-\fxb{??}})$, and hence significantly slower than CBQ in $N$. 
% Although we are not aware of any formal results, we can expect a similar bound for LSMC and KLSMC but with $N^{-1}$ instead of $N^{-\frac{2s_\calX}{d}+\varepsilon}$ (due to the MC integration rate); this explains why our method will outperform these approaches when $s_\calX$ is large relative to $d$. Since the cost of CBQ is $\calO(TN^3+T^3)$, we note that we can take $N=\calO(T^{\frac{2}{3}})$ without increasing the overall cost.

% The rate in $T$ will depend on the smoothness of $I(\theta)$ through the smoothness parameter $r \in [1/2, 1]$, which in turns depends on the smoothenss of the integrand $f$ and the density $p_\theta$ in $\theta$. The smoother these are, the faster the convergence rate will be.  
% Although we are not aware of such a result, we can expect the same rate in $T$ to hold for KLSMC since it is based on kernel ridge regression. On the other hand, LSMC will be inherently limited due to the use of linear or polynomial regression, and we expect it may not be possible to show consistency when $I(\theta)$ is not a polynomial in $\theta$.



