\section{Theoretical Results}\label{sec:theory}

Our main theoretical result in \Cref{thm:convergence} below guarantees that CBQ is able to recover the true value of $I(\theta)$ when $N$ and $T$ grow. The result of this theorem depends on the smoothness of the problem. We will say a function has smoothness $s$ if it is in the Sobolev space $\calW^{s, 2}$ of functions with at least $s$ (weak) derivatives that are square Lebesgue-integrable \citep{adams2003sobolev}. For a multi-index $\alpha = (\alpha_1, \dots \alpha_p) \in \mathbb{N}^p$, by $D_\theta^\alpha f$ we denote the $|\alpha|=\sum_{i=1}^d \alpha_i$ order weak derivative of a function $f$ on $\Theta$. Similarly, we will say a kernel has smoothness $s$ whenever its corresponding RKHS is a space of functions of smoothness $s$. This is for example the case of the Mat\'ern$-\nu$ kernel in dimension $d$ whenever $s= \nu +d/2$, defined as $k_\nu(x,y) = \frac{\eta}{\Gamma(\nu)2^{\nu - 1}} (\frac{\sqrt{2 \nu}}{l} \| x - y \|_2 )^\nu K_\nu(\frac{\sqrt{2 \nu}}{l} \| x-y \|_2)$ where $K_\nu$ is the modified Bessel function of the second kind and $\eta,l >0$ are hyperparameters. 

% \fxb{I think we should change A2 to allow for any distribution on $\Theta$ with full support. This would then allow us to say that A2 is satisfied for all of the experiments. Currently the finance example uses lognormal and the health economics example uses Gaussian (all others are uniform).} \masha{Done.}

% Furthermore, to quantify smoothness in a way specific to the task, we use the \emph{source condition} of~\citep{gogolashvili2023importance} for $r \in [1/2, 1]$. This is a standard condition in the kernel methods literature that compares the smoothness of $I(\theta)$ to the least smooth function in the RKHS of $k_\Theta$: if $k_\Theta$ is not smoother than $I(\theta)$, the condition holds for $r=1/2$, and a larger $r$ implies smoother $I(\theta)$; see~\Cref{appendix:convergence_rate}. 

% \vspace{-2pt}
\begin{theorem}\label{thm:convergence}
Let $x \mapsto f(x, \theta)$ be a function of smoothness $s_f > d/2$, and $\theta \mapsto f(x, \theta)$ be a function of smoothness $s_I > p/2$ such that $\sup_{\theta \in \Theta} \max_{|\alpha|<s_I} \| D_\theta^\alpha f(\cdot, \theta) \|_{\calW^{s_I, 2}(\calX)}<\infty$. Suppose the following assumptions hold:
\vspace{-2pt}
\begin{enumerate}[itemsep=0.1pt,topsep=0pt,leftmargin=*]
\item [A1] The domains $\calX \subset \R^d$ and $\Theta\subset \R^p$ are open, convex, and bounded.
~\label{as:domains}
\item [A2] The parameters and samples satisfy: 
$\theta_{1:T} \sim \mathbb{Q}$, and $x_{1:N}^t \sim \Pb_{\theta_t}$ for all $t \in \{1,\ldots, T\}$. 
~\label{as:pars_and_samples}
\item [A3] $\mathbb{Q}$ has density $q$ such that $\inf_{\theta \in \Theta} q(\theta)>0$ and $\sup_{\theta \in \Theta} q(\theta) < \infty$, and $\Pb_\theta$ has density $p_\theta$ such that $\theta \mapsto p_\theta(x)$ is of smoothness $s_I > p/2$, and for any $\theta \in \Theta$, it holds that $\inf_{\theta \in \Theta, x \in \calX} p_{\theta}(x)>0$ and $\sup_{\theta \in \Theta}\max_{|\alpha|\leq s} \|D_\theta^\alpha p_\theta(x)\|_{\calL^\infty(\calX)}<\infty$.
~\label{as:densities}
\item [A4] The kernels $k_\calX$ and $k_\Theta$ are 
%Mat\'ern 
of smoothness $s_\calX \in (d/2, s_f]$ and $s_\Theta \in (p/2, s_I]$ respectively.~\label{as:kernels} 
\item [A5] The regularisers satisfy $\lambda_{\calX}=0$ and $\lambda_{\Theta} = \calO(T^{\frac{1}{2}})$.
\end{enumerate}
\vspace{-2pt}
Then,  we have that for any $\delta \in (0, 1)$ there is an $N_0>0$ such that for any $N \geq N_0$ with probability at least $1-\delta$ it holds that
\begin{align*}
    \left\| \hat I_\mathrm{CBQ} - I \right\|_{\calL^2(\Theta)}
    \leq  C_0(\delta) N^{-\frac{s_\calX}{d} + \varepsilon} + C_1(\delta) T^{-\frac{1}{4}}  ,
\end{align*}
% \begin{align*}
%         \| \hat I_\mathrm{CBQ}(\theta) - I(\theta) \|_{\calL^2(\Theta)} \leq  C_1(\delta) T^{-\frac{r}{2r+1}} + C_2(\delta) T^{-\frac{r+1}{2r+1}} N^{-\frac{2s_\calX}{d} + \varepsilon} ,
% \end{align*}
%
for any arbitrarily small $\varepsilon>0$, and the constants $C_0(\delta)=\calO(1/\delta)$ and $C_1(\delta)=\calO(\log(1/\delta))$ are independent of $N, T, \varepsilon$.
\end{theorem}
To prove the result, we represent the CBQ estimator as a \emph{noisy importance-weighted kernel ridge regression} (NIW-KRR) estimator. Then, we extend convergence results for the \emph{noise-free} IW-KRR estimator established in~\citet[Theorem 4]{gogolashvili2023importance} to bound Stage 2 error in terms of the error in Stage 1, which in turn we bound via results on the convergence of GP interpolation from \cite{wynne2021convergence}. See~\Cref{appendix:convergence_rate} for the detailed proof.

We now briefly discuss our assumptions. Many of these were simplified to improve readability, in which case we highlight possible generalisations. A1 is used to guarantee the points eventually cover the domain, and could straightforwardly be generalised to any open and bounded domain with Lipschitz boundary satisfying an interior cone condition; see \cite{kanagawa2020convergence,wynne2021convergence}. A2 ensures  $\theta_{1:T}$ and $x_{1:N}^t$ cover $\calX$ and $\Theta$ sufficiently fast in probability as $N$ and $T$ grow.  The assumption on the point sets could also be straightforwardly generalised to active learning designs or grids following existing work on BQ convergence \citep{Kanagawa2019,kanagawa2020convergence,wynne2021convergence}. A3 ensures that the points will fill $\calX$. A4 guarantees that our first and second stage GPs have the right level of regularity for the problem, although the range of smoothness values could be significantly extended following the approach of \cite{kanagawa2020convergence}. For simplicity, we also implicitly assume that the kernel hyperparameters (such as lengthscales and amplitudes) are known, but this could be extended to estimation in bounded sets; see \citep{Teckentrup2020}. Finally, A5 requires $\lambda_{\calX}=0$, but this could be relaxed at the cost of slowing down convergence (see \Cref{appendix:convergence_rate}). In contrast, growing $\lambda_{\Theta}>0$ in $T$ is natural since we work in a bounded domain and we expect the conditioning of the Gram matrix to become worse as $T \rightarrow \infty$. 
% As previously discussed, we will take $N$ large but finite, and drive $T \rightarrow \infty$. For this reason, we only need $\lambda_{\Theta}$ to grow whilst $\lambda_{\calX}$ may remain constant or be taken to be zero.

% \fxb{We need a discussion of the assumptions on $p_\theta$ and how reasonable these are}

We are now ready to discuss the implications of the theorem.
Firstly, the result is expressed in probability to account for randomness in $\theta_{1:T}$ and $x_{1:N}^t$, and provides a rate of $\calO(T^{-1/4}+ N^{- s_\calX/d + \varepsilon})$. We can see that growing $N$ will only help up to some extent (as the second terms approaches zero fast), but that growing $T$ is essential to ensure convergence. This is intuitive since we cannot expect to approximate $I(\theta)$ uniformly simply by increasing $N$ at some fixed points in $\Theta$. 
Despite this, we will see in \Cref{sec:experiments} that increasing $N$ will be essential to improving performance in practice. The rate in $N$ will typically be very fast for smooth targets, but is significantly slowed down for large $d$, demonstrating that our method is mostly suitable for low- to mid-dimensional problems, a common feature shared by Bayesian quadrature based algorithms~\citep{fx_quadrature, frazier2018bayesian}.
There have been some attempts to scale BQ/CBQ to high dimensions; for example in section 5.4 of \cite{fx_quadrature} where the integrand can be decomposed into a sum of low-dimensional functions, however, this is only possible in limited settings when the integrand has certain forms of sparsity. 

\begin{figure*}[t]
\vspace{-10pt}
    \centering
    \begin{minipage}{\textwidth}
    \centering
    \includegraphics[width=320pt]{figures/legend.pdf}
    \vspace{-7pt}
    \end{minipage}
    
    \centering
    \begin{subfigure}{0.33\textwidth}
        \centering
        \hspace{-10pt}
        \includegraphics[width=1.0\linewidth]{figures/bayes_sensitivity_N_50.pdf}
        % \caption{RMSE with fixed $N$}
        \label{fig:bayes_sensitivity_1}
    \end{subfigure}%
    \hfill % Add horizontal space between the subfigures
    % Second plot
    \begin{subfigure}{0.33\textwidth}
        \centering
        \hspace{-10pt}
        \includegraphics[width=1.0\linewidth]{figures/bayes_sensitivity_T_50.pdf}
        % \caption{RMSE with fixed $T$}
        \label{fig:bayes_sensitivity_2}
    \end{subfigure}%
    \hfill % Add horizontal space between the subfigures
    % Third plot
    \begin{subfigure}{0.33\textwidth}
        \centering
        \hspace{-10pt}
        \includegraphics[width=1.0\linewidth]{figures/bayes_sensitivity_dimensions.pdf}
        % \caption{RMSE with increasing $D$}
        \label{fig:bayes_sensitivity_3}
    \end{subfigure}
    \vspace{-3pt}
    \caption{\emph{Bayesian sensitivity analysis for linear models.} \textbf{Left:} RMSE of all methods when $d=2$ and $N=50$. \textbf{Middle:} RMSE of all methods when $d=2$ and $T=50$. \textbf{Right:} RMSE of all methods when $N=T=100$.}
    \label{fig:bayes_sensitivity}
\end{figure*}


Although the bound is dominated by a term $\calO(T^{-1/4})$ in $T$, the proof can be extended to provide a more general result with rate up to $\calO(T^{-1/3})$ under an additional ``source condition'' which requires stronger regularity from $f$; this is further discussed in~\Cref{appendix:convergence_rate}. 
The latter rate is minimax optimal for any nonparametric regression-based method \citep{Stone1982}. Compared to baselines, we note that we cannot expect a similar result for IS since IS does not apply when $f$ depends on $\theta$. 
For LSMC, we also cannot guarantee consistency of the algorithm when $I(\theta)$ is not a polynomial (unless $p \rightarrow \infty$; see \cite{stentoft2004convergence}). 
Although we are not aware of any such result, we expect KLSMC to have the same rate in $T$ as CBQ, and for CBQ to be significantly faster than KLSMC in $N$. This is due to the second stage of KLSMC being essentially the same as that for CBQ, and KLSMC using MC rather than BQ in the first stage: by~\cite{novak1988deterministic}, the convergence rate of BQ, $N^{-s_\mathcal{X}/d}$, is faster than that of MC, $N^{-1/2}$, in the case where the function $x \to f(x, \theta)$ is of smoothness at least $s_\mathcal{X} > d/2$.





%We are not aware of any results for KLSMC, so derived a bound for it which is similar to \Cref{thm:convergence}. This bound, presented in \fxb{Appendix XXX}, is of the form $\calO(T^{-\frac{1}{4}}+ T^{-\frac{3}{4}} N^{-\fxb{??}})$, and hence significantly slower than CBQ in $N$. 

% Although we are not aware of any formal results, we can expect a similar bound for LSMC and KLSMC but with $N^{-1}$ instead of $N^{-\frac{2s_\calX}{d}+\varepsilon}$ (due to the MC integration rate); this explains why our method will outperform these approaches when $s_\calX$ is large relative to $d$. Since the cost of CBQ is $\calO(TN^3+T^3)$, we note that we can take $N=\calO(T^{\frac{2}{3}})$ without increasing the overall cost.

% The rate in $T$ will depend on the smoothness of $I(\theta)$ through the smoothness parameter $r \in [1/2, 1]$, which in turns depends on the smoothenss of the integrand $f$ and the density $p_\theta$ in $\theta$. The smoother these are, the faster the convergence rate will be.  
% Although we are not aware of such a result, we can expect the same rate in $T$ to hold for KLSMC since it is based on kernel ridge regression. On the other hand, LSMC will be inherently limited due to the use of linear or polynomial regression, and we expect it may not be possible to show consistency when $I(\theta)$ is not a polynomial in $\theta$.



