\section{Background}
\label{sec:background}
In this section, we provide background on function-space variational inference in BNNs and discuss the fundamental issue of an infinite KL divergence.
We then introduce the regularized KL divergence, which is the basis for our solution presented in Section~\ref{sec:methods}. 
%
\subsection{Function-space VI in BNNs}
\label{sec:fsvi_in_bnns}
%
We consider a neural network $f({\,\cdot\,; \vw})$ with weights $\vw \in \real^p$, and a data set $\mathcal{D} = \{(\vx_i, y_i)\}_{i=1}^N$ with features~$\vx_i \in \mathcal{X} \subset \real^d$ and values~$y_i \in \mathcal{Y}$.
Bayesian Neural Networks are specified further by a likelihood function $\prob{\mathcal D \g \vw} = \prod_{i=1}^N \prob{y_i \g f(\vx_i;\vw)}$ and---traditionally---a prior $\prob{\vw}$ on the weights, and one seeks the posterior distribution $\prob{\vw \g \mathcal{D}} \propto \prob{\mathcal D \g \vw} \, \prob{\vw}$.
The method proposed in this paper builds on variational inference, which approximates $\prob{\vw \g \mathcal D}$ with a variational distribution $q_\phi\br{\vw}$, whose variational parameters~$\phi$ maximize the evidence lower bound (ELBO),
\begin{align}\label{eq:elbo-weight space}
    \mathcal L(\phi) := \expect{\log \prob{\mathcal D \g \vw}}{q_\phi\br{\vw}} - \kl{q_\phi}{p}
\end{align}
where $D_\text{KL}$ is the Kullback-Leibler divergence,
\begin{equation}\label{eq:kl-weight space}
    \kl{q_\phi}{p} := \expect{\log\br{q_\phi\br{\vw} \big/ \prob{\vw}}}{q_\phi\br{\vw}}.
\end{equation}
At test time, we approximate the predictive distribution for given features~$\vx^*$ as $\prob{y^* \g \vx^*} = \expect{\big.\prob{y^* \g f(\vx^*;\vw)}}{\prob{\vw \g \mathcal{D}}} \approx \expect{\big.\prob{y^* \g f(\vx^*;\vw)}}{q_\phi(\vw)}$.
%
\paragraph{Function-space variational inference.}
%
Since neural network weights are not interpretable, we replace the weight-space prior $p(\vw)$ with a prior~$\measureP$ directly on the function $f\br{\,\cdot\,;\vw}$, which we denote simply as~$f$ when there is no ambiguity.
Here, the symbol~$\measureP$ denotes a probability measure that does not admit a density since the function space is infinite-dimensional.
We compute the expected log-likelihood as in the first term of Eq.~\ref{eq:elbo-weight space}.
For the KL-term (Eq.~\ref{eq:kl-weight space}), a naive VI-method would use the pushforward of $q_\phi(\vw)$ along $\vw\mapsto f\br{\,\cdot\,;\vw}$, which defines the variational measure~$\measureQ_\phi$, resulting in the ELBO in function space,
\begin{equation}\label{eq:elbo-function space}
    \mathcal L(\phi) := \expect{\log \prob{\mathcal D \g \vw}}{q_\phi\br{\vw}} - \kl{\measureQ_\phi}{\measureP}
\end{equation}
with $D_\text{KL}$ the KL divergence between measures
\begin{equation}\label{eq:kl-function space}
    \kl{\measureQ_\phi}{\measureP} := \int\! \log \br{\!\frac{d\measureQ_\phi}{d\measureP}(f)\!} d\measureQ_\phi.
\end{equation}
Here, the Raydon-Nikodym derivative $d\measureQ_\phi / d\measureP$ generalizes the density ratio from Eq.~\ref{eq:kl-weight space}.
Like Eq.~\ref{eq:elbo-weight space}, the ELBO in Eq.~\ref{eq:elbo-function space} is a lower bound on the evidence \citep{burt2020understanding}.
In fact, if $\measureP$ is the push-forward of $p(\vw)$ then Eq.~\ref{eq:elbo-function space} is a tighter bound than Eq.~\ref{eq:elbo-weight space} by the data processing inequality, $\kl{\measureQ_\phi}{\measureP} \leq \kl{q_\phi}{p}$.
However, we motivated function-space VI to avoid weight-space priors, and in this case the bound in Eq.~\ref{eq:elbo-function space} can be looser.
We will indeed see below that the bound becomes infinitely loose in practice, and we thus propose a different objective in Section~\ref{sec:methods}.

Two intractabilities prevent directly maximizing the ELBO in function space (Eq~\ref{eq:elbo-function space}).
First, it is not obvious how to evaluate or estimate the KL divergence between two measures in Eq~\ref{eq:kl-function space}.
\citet{sun2018functional} showed that it can be expressed as a supremum of KL divergences between finite-dimensional distributions,
\begin{equation} \label{eq:sun_kl}
    \kl{\measureQ_\phi}{\measureP} = \sup_{\vx \in \mathcal{X}^M, M \in \mathbb{N}} \kl{q_\phi(f(\vx))}{p(f(\vx))}.
\end{equation}
Here, $\vx = \{\vx^{(i)}\}_{i=1}^M \in \mathcal{X}^M$ is a set of $M$ points in feature space~$\mathcal X$, and $q_\phi(f(\vx))$ and $p(f(\vx))$ are densities of the marginals of $\measureQ_\phi$ and~$\measureP$ on~$\{f(\vx^{(i)})\}_{i=1}^M$ respectively.
\citet{sun2018functional} approximates the supremum over infinitely many sets by an expectation, and \citet{rudner2022fsvi} estimates it from samples.

Second, we cannot express the pushforward measure~$\measureQ_\phi$ in closed form because the neural network is nonlinear. 
Previous work has proposed to mitigate this issue using implicit score function estimators \citep{sun2018functional} or a linearized BNN~$f_L$ to obtain a closed-form Gaussian variational measure \citep{rudner2022ContinualLearningFSVI, rudner2022fsvi}.
Our proposal in Section~\ref{sec:methods} follows the linearized BNN approach as it only minimally modifies the BNN, preserving most of its inductive bias \citep{maddox2021FastAdapt} while considerably simplifying the problem by turning the pushforward of $q_\phi(\vw)$ into a GP.
More specifically, we consider a Gaussian variational distribution $q_{\phi}\br{\vw} = \gaussian{\vm}{\mS}$ with parameters $\phi = \set{\vm, \mS}$, 
and we define a linearized BNN~$f_L$ by linearizing~$f$ as a function of the weights around~$\vw=\vm$,
\begin{equation}\label{eq:linearized_nn}
    f_L(\vx; \vw) := f(\vx; \vm) + J(\vx; \vm)(\vw - \vm)
\end{equation}
with $J(\vx; \vm) = \grad{\!\vw}{f({\vx; \vw})}|_{\vw=\vm}$. Thus, $\vw\sim q_\phi(\vw)$ implies $f_L(\vx;\vw) \sim \gaussian{f(\vx; \vm)}{J(\vx;\vm) \mS J(\vx;\vm)\tp}$ for all~$\vx \in \mathcal{X}$, and so the function $f_L(\,\cdot\,; \vw)$ is a degenerate GP (as $\operatorname{rank}(J(\,\cdot\,;\vm) \mS J(\,\cdot\,;\vm)\tp) \leq p <\infty$),
\begin{equation}\label{eq:linearized-q}
    f_L \sim \mathcal{GP}\br{f(\,\cdot\,; \vm), J(\,\cdot\,;\vm) \mS J(\,\cdot\,;\vm)\tp}.
\end{equation}
%
\paragraph{$\kl{\measureQ_\phi}{\measureP}$ is infinite in most relevant cases.}
%
\citet{burt2020understanding} point out an even more severe issue of function-space VI in BNNs: $\kl{\measureQ_\phi}{\measureP}$ (Eq.~\ref{eq:kl-function space}) is in fact infinite in most relevant cases, in particular for non-degenerate GP-priors.
Thus, approximating $\kl{\measureQ_\phi}{\measureP}$ in these settings is futile.
Their proof is somewhat involved, but the fundamental reason for $\kl{\measureQ_\phi}{\measureP}=\infty$ is that $\measureQ_\phi$~has support on a finite-dimensional submanifold of the infinite-dimensional function space, while the measure~$\measureP$ induced by a (non-degenerate) GP prior has support on the entire function space.
That such a dimensionality mismatch can lead to infinite KL divergence can already be seen in a finite-dimensional example: consider the KL-divergence between two Gaussians in~$\mathbb R^n$ for~$n\geq 2$, one of which has support on the entire~$\mathbb R^n$ (i.e., its covariance matrix~$\mSigma_1$ has full rank) while the other one has support only on a proper subspace of~$\mathbb R^n$ (i.e., its covariance matrix~$\mSigma_2$ is singular).
The KL divergence between multivariate Gaussians has a closed form expression (Eq.~\ref{eq:reg-kl-estimator} with ${\gamma=0}$) that contains $\log\det\br{\mSigma_2^{-1}\mSigma_1}$, which is infinite for singular~$\mSigma_2$.

We find that the fact that $\kl{\measureQ_\phi}{\measureP}=\infty$ has severe practical consequences even when the KL divergence is only estimated from finite samples.
It naturally explains the stability issues discussed in Appendix~D.1 of \citet{sun2018functional}
(we compare the authors' solution to this stability issue to our method in \Cref{sec:gfsvi-comparison}).
Surprisingly, similar complications arise even in the setup by \citet{rudner2022fsvi}, which performs VI in function space with the pushforward of a weight-space prior.
While this makes the KL divergence technically finite because prior and variational posterior have the same support, numerical errors lead to mismatching supports and thus to stability issues even there.

In summary, the ELBO for VI in BNNs is not well-defined for most interesting function-space priors.
In \cref{sec:methods}, we propose a solution by using the so-called regularized KL divergence, which we introduce next.

\subsection{Regularized KL divergence}
\label{sec:reg_kl}
%
Our solution to the negative infinite function-space ELBO builds on a regularized KL divergence, which is expressed in terms of Gaussian measures for the variational posterior and prior.
We obtain these Gaussian measures from GPs.
We first discuss under which conditions a GP induces a Gaussian measure, and then present the regularized KL divergence.
%
\paragraph{Gaussian measures and Gaussian processes.}
\label{par:gaussian-measures}
%
The regularized KL divergence is defined in terms of Gaussian measures, and thus we need to verify that the GP variational posterior induced by the linearized BNN (Eq.~\ref{eq:linearized-q}) has an associated Gaussian measure.
We consider the Hilbert space $\Ltwo{\mathcal{X}}{\rho}$ of square-integrable functions with respect to a probability measure $\rho$ on a compact set $\mathcal{X} \subset \real^d$, with inner product $\innerProd{f}{g} = \int_\mathcal{X} f(x) g(x) d\rho(x)$. 
This assumption is not restrictive since we can typically bound the region in feature space that contains the data and any points where we might want to evaluate the BNN. 
\begin{definition}[Gaussian measure, \citet{kerrigan2023diffusion}, Definition 1]\label{def:gm}
Let $(\Omega, \mathcal{B}, \measureP)$ be a probability space. A measurable function $F: \Omega \mapsto \Ltwo{\mathcal{X}}{\rho}$ is called a Gaussian random element (GRE) if for any $g \in \Ltwo{\mathcal{X}}{\rho}$ the random variable $\innerProd{g}{F}$ has a Gaussian distribution on $\real$. 
For every GRE $F$, there exists a unique mean element $m \in \Ltwo{\mathcal{X}}{\rho}$ and a finite trace linear covariance operator $C: \Ltwo{\mathcal{X}}{\rho} \mapsto \Ltwo{\mathcal{X}}{\rho}$ such that $\innerProd{g}{F} \sim \gaussian{\innerProd{g}{m}}{\innerProd{Cg}{g}}$ for all $g \in \Ltwo{\mathcal{X}}{\rho}$.
The pushforward of $\measureP$ along~$F$, denoted $\measureP^F := F_{\#} \measureP$, is a Gaussian measure on $\Ltwo{\mathcal{X}}{\rho}$.
\end{definition}

Gaussian measures generalize Gaussian distributions to infinite-dimensional function spaces where measures do not have associated densities since there is no Lebesgue measure. 
Following \citet{wild2022gvi}, we notate the Gaussian measure obtained from the GRE $F$ with mean element $m$ and covariance operator $C$ as $\measureP^F := \gaussian{m}{C}$.
%
GPs provide a practical tool to specify Gaussian measures via mean and covariance functions \citep{kerrigan2023diffusion}.
A GP $f \sim \mathcal{GP}(\mu, K)$ has an associated Gaussian measures in $\Ltwo{\mathcal{X}}{\rho}$ if its mean function satisfies $\mu \in \Ltwo{\mathcal{X}}{\rho}$ and its covariance function $K$ is trace-class, i.e., if $\int_{\mathcal{X}} K(x, x) d \rho \br{x} < \infty$ \citep[Theorem~1]{wild2022gvi}.
The GP variational posterior induced by the linearized BNN satisfies both properties as neural networks are well-behaved functions on the compact $\mathcal{X} \subset \real^d$.
It thus induces a Gaussian measure $\measureQ_\phi^F \sim \gaussian{m_Q}{C_Q}$ with mean element $m_Q = f(\,\cdot\,; \vm)$ and covariance operator $C_Q g \br{\cdot} = \int_{\mathcal{X}} J(\,\cdot\,; \vm) \mS J(\vx', \vm)\tp g(\vx') d\rho\br{\vx'}$.
The infinite KL divergence discussed in Section~\ref{sec:fsvi_in_bnns} is easier to prove for the special case of Gaussian measures, and we provide the proof in \cref{sec:infinite_kl}.
%
\begin{definition}[Regularized KL divergence, \citet{quang2022gpkl} Definition 5]
Let $\nu_1=\gaussian{m_1}{C_1}$ and $\nu_2=\gaussian{m_2}{C_2}$ be two Gaussian measures with $m_1, m_2 \in \Ltwo{\mathcal{X}}{\rho}$ and $C_1, C_2$ bounded, self-adjoint, positive and trace-class linear operators on $\Ltwo{\mathcal{X}}{\rho}$. 
Let $\gamma \in \real_{>0}$ be fixed. 
The regularized KL divergence is defined as follows,
\begin{align}
&\regkl{\nu_1}{\nu_2} := \frac{1}{2} \innerProd{m_1 - m_2}{(C_2 + \gamma \eye)^{-1} \br{m_1 - m_2}} \nonumber\\
&\qquad+ \frac{1}{2} \operatorname{Tr}_X \left[ \br{C_2 + \gamma \eye}^{-1} \br{C_1 + \gamma \eye} - \eye \right] \nonumber\\
&\qquad- \frac{1}{2} \log \operatorname{det}_X \left[ \br{C_2 + \gamma \eye}^{-1} \br{C_1 + \gamma \eye}\right]. \label{eq:regkl-definition}
\end{align}
\end{definition}
Here $\operatorname{Tr}_X$ and $\operatorname{det}_X$ are the extended trace and extended Fredholm determinant \citep{quang2022gpkl}.
For any $\gamma > 0$, the regularized KL divergence is well-defined and finite (following \citet[Proposition 1]{quang2017infinitedimensionallogdeterminantdivergencesii}), even if the Gaussian measures are singular \citep{quang2019regularizedKL}.
%
It converges to the conventional KL divergence (if it is well-defined) for $\gamma \to 0$ (\citealp[Theorem 6]{quang2022gpkl}). 
Furthermore, if the Gaussian measures $\nu_1$ and $\nu_2$ are induced by GPs
$\mathcal{GP}(\mu_i, K_i)$ for $i=1,2$, respectively, then $\regkl{\nu_1}{\nu_2}$ is consistently estimated~\citep{quang2022gpkl} by
\begin{align}
    \regklhat{\nu_1}{\nu_2} :=&\, \frac{1}{2} \br{\vm_1 - \vm_2}\tp (\mSigma^{(\gamma)}_2)^{-1} \br{\vm_1 - \vm_2} \nonumber\\
    +&\, \frac{1}{2} \operatorname{Tr}\big[(\mSigma_2^{(\gamma)})^{-1}\mSigma^{(\gamma)}_1 - \eye_M\big] \nonumber\\
    -&\, \frac{1}{2} \log \det \big[(\mSigma_2^{(\gamma)})^{-1}\mSigma^{(\gamma)}_1\big] \label{eq:reg-kl-estimator}
\end{align}
with $\vm_i:=\mu_i(\vx)$ and $\mSigma_i^{(\gamma)}:=K_i(\vx,\vx) + \gamma M \,\eye_M$ where $\mu_i(\vx)$ and $K_i(\vx,\vx)$ are the mean vector and the covariance matrix obtained by evaluating $\mu_i$ and $K_i$ respectively, at measurement points $\vx = \{\vx^{(i)}\}_{i=1}^M \oset[.40ex]{\textup{\tiny i.i.d}}{\sim} \rho(\vx)$. 
The right-hand side of Eq.~\ref{eq:reg-kl-estimator} is the expression for the KL-divergence between Gaussian distributions $\mathcal N({\vm_1},{\mSigma_1^{(\gamma)}})$ and $\mathcal N({\vm_2},{\mSigma_2^{(\gamma)}})$.
\citet{quang2022gpkl} shows that the absolute error of the estimator is bounded by $\mathcal{O}(\sqrt{1/M})$ with high probability with constants depending on $\gamma$ and properties of the GP mean and covariance functions (see \cref{sec:app_reg_kl} for the exact bound).