\section{Generalized function-space VI with the regularized KL divergence}
\label{sec:methods}
%
This section presents our proposed generalized function-space variational inference (GFSVI) method, which addresses the problem of the infinite KL divergence discussed in \cref{sec:fsvi_in_bnns}, which we take for an indication that VI is too restrictive if one wants to use genuine function-space priors.
We instead consider generalized variational inference \citep{knoblauch2019generalized}, which reinterprets the ELBO in Eq.~\ref{eq:elbo-weight space} as a regularized expected log-likelihood and explores alternative divergences for the regularizer.
Specifically, we propose to use the regularized KL divergence.
This section builds heavily on tools introduced in \cref{sec:background}, which turn out to fit together perfectly: the pushforward of a Gaussian variational distribution in weight-space through the linearized neural network (Eq.~\ref{eq:linearized_nn}) induces a GP variational posterior (Eq.~\ref{eq:linearized-q}) that admits a Gaussian measure on $\Ltwo{\mathcal{X}}{\rho}$. % (\cref{def:gm}).
Further, selecting a GP prior which has an associated Gaussian measure on $\Ltwo{\mathcal{X}}{\rho}$ allows us to use the regularized KL divergence (Eq.~\ref{eq:regkl-definition}). 
We present GFSVI in \cref{sec:gfsvi} and compare it to prior work in \cref{sec:gfsvi-comparison}.
%
\subsection{Generalized function-space VI}
\label{sec:gfsvi}
%
We present a well-defined objective for function-space inference, and a simple algorithm for its optimization.
%
\paragraph{Objective function.}
%
We start from the ELBO in Eq.~\ref{eq:elbo-function space}, where we use the Gaussian variational measure~$\measureQ_\phi^F$ induced by the pushforward of a Gaussian variational distribution $q_\phi(\vw) = \gaussianx{\vw}{\vm}{\mS}$ along the linearized network~$f_L$ (Eq.~\ref{eq:linearized_nn}). 
The function-space prior may be any GP that has an associated Gaussian measure~$\measureP^F$ on $\Ltwo{\mathcal{X}}{\rho}$.
We now replace the KL divergence in the ELBO with the regularized KL divergence $D_\text{KL}^\gamma$ (Eq.~\ref{eq:regkl-definition}), which is well-defined and finite for any pair of Gaussian measures.
For a likelihood function $\prob{\mathcal{D} \g \vw} = \prod_{i=1}^N \prob{y_i \g f_L(\vx_i; \vw)}$, we obtain
\begin{equation}\label{eq:objective}
    \mathcal{L}(\phi) \!:=\!\! \sum_{i=1}^N \expect{\log \prob{y_i | f_L(\vx_i; \vw)}}{q_\phi(\vw)} \!-\! \regkl{\measureQ_\phi^F}{\measureP^F}.
\end{equation}
%
\paragraph{Estimation and optimization.}
The expected log-likelihood (first term in Eq.~\ref{eq:objective}) can be estimated by sampling from $q_\phi(\vw)$.
For a Gaussian likelihood, it can also be computed in closed form as (unlike \citet{rudner2022fsvi}) we use the linearized network~$f_L$, which made training more stable in our experiments.
We estimate the regularized KL divergence (second term in Eq.~\ref{eq:objective}) using its consistent estimator (see Eq.~\ref{eq:reg-kl-estimator}), with
$\vm_1 = f(\vx; \vm)$,
$\mSigma_1^{(\gamma)} = J(\vx; \vm) \mS J(\vx; \vm)\tp+\gamma M \eye_M$,
$\vm_2 = \mu(\vx)$, and
$\mSigma_2^{(\gamma)} = K(\vx, \vx)+\gamma M \eye_M$,
where $\mu$ and~$K$ are the mean and covariance functions of the GP prior, and $\vx = \{\vx^{(i)}\}_{i=1}^M \oset[.40ex]{\textup{\tiny i.i.d}}{\sim} \rho(\vx)$ are measurement points.
We maximize the estimated objective over the mean~$\vm$ and covariance~$\mS$ of the Gaussian variational distribution $q_\phi(\vw)$, and over any likelihood parameter (e.g., the variance of a Gaussian likelihood), see Algorithm~\ref{alg:fsvi}.
\cref{app:sec_gfsvi_estimator} provides expressions for the estimator with Gaussian and Categorical likelihoods as well as an analysis of their computational complexity.
%
\paragraph{Technical details ($\gamma$ and $\rho$).}
%
It turns out that increasing~$\gamma$ reduces the influence of the prior on inference (see \cref{fig:regkl_vs_gamma}).
At the same time, $\gamma$~acts as jitter that prevents numerical errors (see \cref{sec:gfsvi-comparison}).
We recommend setting~$\gamma$ large enough to avoid numerical errors but sufficiently small to strongly regularize the objective in Eq.~\ref{eq:objective} (see \cref{fig:fsvi_influence_gamma} in appendix) and setting $M$ to the largest value allowed by the computational budget.
We found that the estimator $\hat D_\text{KL}^\gamma \big(\measureQ_\phi^F \,\big|\!\big|\, {\measureP^F}\big)$ converges quickly to a finite value (especially for smooth kernels, see \cref{fig:regkl_vs_gamma} in appendix), and that GFSVI is robust to a wide range of values (we fixed $\gamma=10^{-10}$).
The probability measure~$\rho$ for $\Ltwo{\mathcal{X}}{\rho}$ has to assign non-zero probability to any open set of $\mathcal{X}$ to regularize the BNN on all of its support.
Following \citet{rudner2022fsvi}, we draw measurement points from a uniform distribution over~$\mathcal{X}$ when using tabular data and explore different configurations (samples from other data sets) for high-dimensional image data (see \cref{app:sec_classification_details}).
%
\begin{figure*}[t]
    \centering
    \resizebox{\linewidth}{!}{
    \includegraphics[width=\linewidth]{plots/fsvi_Matern12_vs_baselines.pdf}
    }
    \caption{Inference on synthetic data (gray circles) using a Matérn-1/2 prior for function-space methods GFSVI and FVI.
    The proposed GFSVI provides the best approximation of the exact GP posterior.}
    \label{fig:fsvi_matern_vs_baselines}
\end{figure*}
%
\begin{algorithm}[t]
    \caption{Generalized function-space variational inference (GFSVI)}
    \label{alg:fsvi}
    \begin{algorithmic}[1]
    \Require Linearized BNN $f_L$ with measure $\measureQ_\phi^F$, GP prior $\mathcal{GP}(\mu, K)$ with measure $\measureP^F$, measurement point distribution~$\rho(\vx)$, data $\mathcal{D} = \{(\vx_i, y_i)\}_{i=1}^N$, batch size~$B$, $\gamma > 0$.
    \ForAll{\textnormal{minibatch }$(\vx_\mathcal{B}, y_\mathcal{B}) \sim \mathcal{D}$}
        \State \hbox{Calculate $\hat\ell_1 = \frac{N}{B}\expect{\log \prob{y_\mathcal{B} \g f_L(\vx_\mathcal{B}, \vw)}}{q_{\phi}(\vw)}$;}
        \State \hbox{Draw measurement points $\vx = \{\vx^{(i)}\}_{i=1}^M \stackrel{\!\!\text{i.i.d.}\!\!}{\sim} \rho(\vx)$;}
        \State \hbox{Calculate $\hat\ell_2 = \hat D_\text{KL}^\gamma \big(\measureQ_\phi^F \,\big|\!\big|\, {\measureP^F}\big)$ using $\vx$ (Eq.~\ref{eq:reg-kl-estimator});}
        \State Calculate $\hat{\mathcal{L}}(\phi) = \hat\ell_1 - \hat\ell_2$\;
        \State Update $\phi$ using a step in the direction $\nabla_{\!\phi} \hat{\mathcal{L}}(\phi)$\;
    \EndFor
\end{algorithmic}
\end{algorithm}
\subsection{Connections to prior work}
\label{sec:gfsvi-comparison}
%
TFSVI \citep{rudner2022fsvi} and FVI \citep{sun2018functional} solve stability issues by introducing jitter/white noise, which has a similar effect as the regularization in Eq.~\ref{eq:regkl-definition}.
However, TFSVI introduces jitter only to overcome numerical issues and is fundamentally restricted to prior specification in weight space since its function-space prior is the pushforward of a weight-space prior.
Conversely, FVI adds white noise to prevent the KL divergence (Eq.~\ref{eq:kl-function space}) to blow up as $M$ increases. 
However, FVI does not linearize the BNN, and hence does not have access to an explicit variational measure in function space.
This severely complicates the estimation of (gradients of) the KL divergence in FVI, and the authors resort to implicit score function estimators, which make their method difficult to use in practice \citep{ma2021funcVIspg}.
Our proposed GFSVI does not suffer from these difficulties as the variational posterior is an explicit Gaussian measure.
This allows us to estimate the regularized KL divergence without sampling any noise or using implicit score function estimators.