\section{Experiments}
\label{sec:experiments}
%
In this section, we evaluate our generalized function-space variational inference (GFSVI) method qualitatively on synthetic data and quantitatively on real-world data.
GFSVI accurately captures structural properties specified by the GP prior, and that it performs competitively on regression, classification and out-of-distribution detection tasks.
We also discuss the influence of the BNN's inductive biases.
%
\paragraph{Baselines.}
\label{sec:baselines}
%
We compare GFSVI to two weight-space inference methods: mean-field VI (MFVI) \citep{blundell2015weight} and linearized Laplace \citep{immer2021linlaplace}; and to three function-space inference methods: FVI \citep{sun2018functional}, TFVSI \citep{rudner2022fsvi} and VIP \citep{ma2019variational} (TFSVI performs inference in function space but with the pushforward of a weight-space prior; VIP uses a BNN prior).
All BNNs have the same architecture and fully-factorized Gaussian approximate posterior.
We also include results for a sparse GP with a posterior mean parameterized by a neural network (GWI) \citep{wild2022gvi}, and for a Gaussian Process (GP) \citep{williams2006gaussian} (when the size of the dataset allows it), and for a sparse GP \citep{hensman2013gaussian} for regression tasks.
We consider the GP, sparse~GP and GWI as gold standards as they represent the exact (or near exact) posterior for models with GP priors. 
%
\begin{figure*}[t]
    \centering
    \resizebox{\linewidth}{!}{
    \includegraphics[width=\linewidth]{plots/ocean_current.pdf}
    }
    \caption{Results for the ocean current modeling experiment. We report the norm of the mean velocity vectors and the squared errors. Unlike TFSVI, we find that GFSVI accurately captures ocean current dynamics.}
    \label{fig:ocean_current}
\end{figure*}
% 
\paragraph{Qualitative results on synthetic data.}
\label{sec:uncert-viz}
%
We consider a $1$-dimensional regression task where the values~$y_i$ are sampled around $\sin(2\pi x_i)$ (circles in Figures~\ref{fig:fsvi_prior_varying_smoothness}-\ref{fig:fsvi_vs_fvi_robert} and~\ref{fig:fsvi_prior_elicitation}) and the two moons 2-dimensional binary classification task \citep{scikitlearn} (see Figures~\ref{fig:classification_gfsvi_RBF_vs_baselines} and~\ref{fig:classification_gfsvi_Matern12_vs_baselines}).
For regression, the green lines show functions sampled from the (approximate) posteriors, and the red lines are the inferred mean functions.
For classification, the first and second row show the inferred mean probability of class 1 (blue dots) and its 2-standard deviations with respect to posterior samples.
More details in Appendix~\ref{app:sec_synthetic_data}.
We find that GFSVI captures the beliefs of the RBF and Matérn-1/2 GP priors better than BNN-baselines in the regression setting (see~\cref{fig:fsvi_RBF_vs_baselines,fig:fsvi_matern_vs_baselines}) as well as in classification (see~\cref{fig:classification_gfsvi_RBF_vs_baselines,fig:classification_gfsvi_Matern12_vs_baselines}), and shows greater uncertainty outside of the support of the data.
\cref{fig:fsvi_prior_varying_smoothness,fig:fsvi_rbf_varying_lengscale} show that GFSVI notably adapts to varying prior assumptions (varying smoothness and length scale, respectively).
In addition, \cref{fig:fsvi_varying_label_noise,fig:fsvi_vs_fvi_robert} in the Appendix show that GFSVI provides strong regularization when the data generative process is noisy, and that it can be trained with fewer measurement points~$M$ than FVI without significant degradation.
%
\paragraph{Inductive biases.}
\cref{fig:fsvi_prior_elicitation} in the Appendix compares GFSVI to the exact GP-posterior across two different priors and three model architectures (details in Appendix~\ref{app:sec_synthetic_data}).
We find that, with ReLU activations, small models are prone to underfitting for smooth priors (RBF), and to collapsing uncertainty for rough priors (Matérn-1/2).
By contrast, with smooth activations (Tanh), smaller models suffice, and they are compatible with most standard GP priors (the results shown in \cref{fig:fsvi_prior_elicitation} extend to RBF, Matérn, and Rational Quadratic in our experiments).
We also analyzed how the number~$M$ of measurement points affects performance.
\cref{fig:fsvi_varying_n_context_points,fig:kernel_gram_eigendecay} in the appendix show that capturing the beliefs of rough GP priors and estimating $D_\text{KL}^\gamma$ with these priors requires larger~$M$.
%
\subsection{Quantitative results on real-world data}
\label{sec:results-quantitative}
%
\begin{table*}[h!]
\scshape
\caption{Test expected log-likelihood (higher is better) of evaluated methods on regression datasets. GFSVI performs competitively compared to all BNN baselines and obtains the best mean rank.
}
\label{tab:expected_ll}
\resizebox{\linewidth}{!}{
\renewcommand{\arraystretch}{1.0}
\begin{tabular}{@{}lccccccccc@{}}
\toprule
\multicolumn{1}{@{}l}{Dataset} & \multicolumn{2}{c}{Function-space priors} & \multicolumn{4}{c}{Weight-space priors} & \multicolumn{3}{c}{Gaussian Processes (Gold Standards)} \\ 
\cmidrule(rl){2-3} \cmidrule(rl){4-7} \cmidrule(rl){8-10}
\multicolumn{1}{@{}l}{} & GFSVI (ours) & \multicolumn{1}{c}{FVI} & TFSVI & MFVI & VIP & Laplace & GWI & Sparse GP & GP  \\ 
\midrule
Boston & \textbf{-0.733 $\pm$ 0.144} & \textbf{-0.571 $\pm$ 0.113} & -1.416 $\pm$ 0.046 & -1.308 $\pm$ 0.052 & \textbf{-0.722 $\pm$ 0.196} & \textbf{-0.812 $\pm$ 0.205} & -0.940 $\pm$ 0.145 & -0.884 $\pm$ 0.182 & -1.594 $\pm$ 0.556 \\
Concrete & \textbf{-0.457 $\pm$ 0.041} & \textbf{-0.390 $\pm$ 0.017} & -0.983 $\pm$ 0.012 & -1.353 $\pm$ 0.018 & \textbf{-0.427 $\pm$ 0.050} & -0.715 $\pm$ 0.025 & -0.744 $\pm$ 0.079 & -0.966 $\pm$ 0.025 & -2.099 $\pm$ 0.421 \\
Energy & \textbf{\hphantom{-}1.319 $\pm$ 0.052} & \textbf{\hphantom{-}1.377 $\pm$ 0.042} & \hphantom{-}0.797 $\pm$ 0.098 & -0.926 $\pm$ 0.197 & \textbf{\hphantom{-}1.046 $\pm$ 0.378} & \textbf{\hphantom{-}1.304 $\pm$ 0.043} & \hphantom{-}0.461 $\pm$ 0.093 & -0.206 $\pm$ 0.027 & -0.205 $\pm$ 0.022 \\
Kin8nm & -0.136 $\pm$ 0.013 & -0.141 $\pm$ 0.023 & -0.182 $\pm$ 0.011 & -0.641 $\pm$ 0.225 & \textbf{-0.102 $\pm$ 0.013} & -0.285 $\pm$ 0.014 & -0.708 $\pm$ 0.054 & -0.443 $\pm$ 0.014 & \textit{(infeasible)} \\
Naval & \textbf{\hphantom{-}3.637 $\pm$ 0.132} & \hphantom{-}2.165 $\pm$ 0.194 & \hphantom{-}2.758 $\pm$ 0.044 & \hphantom{-}1.034 $\pm$ 0.160 & \hphantom{-}1.502 $\pm$ 0.061 & \hphantom{-}3.404 $\pm$ 0.084 & -0.301 $\pm$ 0.254 & \hphantom{-}4.951 $\pm$ 0.014 & \textit{(infeasible)} \\
Power & \textbf{\hphantom{-}0.044 $\pm$ 0.011} & \textbf{\hphantom{-}0.031 $\pm$ 0.021} & \hphantom{-}0.007 $\pm$ 0.013 & -0.003 $\pm$ 0.015 & \textbf{\hphantom{-}0.036 $\pm$ 0.018} & -0.002 $\pm$ 0.019 & \hphantom{-}0.043 $\pm$ 0.009 & -0.100 $\pm$ 0.010 & \textit{(infeasible)} \\
Protein & -1.036 $\pm$ 0.005 & -1.045 $\pm$ 0.005 & -1.010 $\pm$ 0.004 & -1.112 $\pm$ 0.007 & -0.994 $\pm$ 0.007 & -1.037 $\pm$ 0.006 & -1.050 $\pm$ 0.009 & -1.035 $\pm$ 0.002 & \textit{(infeasible)} \\
Wine & -1.289 $\pm$ 0.040 & \textbf{-1.215 $\pm$ 0.007} & -2.138 $\pm$ 0.221 & -1.248 $\pm$ 0.018 & -1.262 $\pm$ 0.025 & -1.249 $\pm$ 0.025 & -1.232 $\pm$ 0.038 & -1.240 $\pm$ 0.037 & -1.219 $\pm$ 0.035 \\
Yacht & \textbf{\hphantom{-}1.058 $\pm$ 0.080} & \textbf{\hphantom{-}0.545 $\pm$ 0.735} & -1.187 $\pm$ 0.064 & -1.638 $\pm$ 0.030 & -0.062 $\pm$ 1.378 & \hphantom{-}0.680 $\pm$ 0.171 & \hphantom{-}0.441 $\pm$ 0.138 & -0.976 $\pm$ 0.092 & -0.914 $\pm$ 0.045 \\
Wave & \hphantom{-}5.521 $\pm$ 0.036 & \hphantom{-}6.612 $\pm$ 0.008 & \hphantom{-}5.148 $\pm$ 0.117 & \textbf{\hphantom{-}6.883 $\pm$ 0.008} & \hphantom{-}4.043 $\pm$ 0.093 & \hphantom{-}4.658 $\pm$ 0.027 & \hphantom{-}1.566 $\pm$ 0.123 & \hphantom{-}4.909 $\pm$ 0.001 & \textit{(infeasible)} \\
Denmark & \textbf{-0.487 $\pm$ 0.013} & -0.801 $\pm$ 0.005 & -0.513 $\pm$ 0.013 & -0.675 $\pm$ 0.007 & -0.583 $\pm$ 0.021 & -0.600 $\pm$ 0.008 & -0.841 $\pm$ 0.026 & -0.768 $\pm$ 0.001 & \textit{(infeasible)} \\
\midrule
Mean rank & 1.545 & 2.000 & 2.727 & 3.455 & 2.091 & 2.455 & - & - & - \\
\bottomrule
\end{tabular}
}
\end{table*}
%
%
\begin{table*}[h!]
\scshape
\caption{
Test expected log-likelihood, accuracy, expected calibration error and OOD detection accuracy on MNIST and Fashion MNIST.
}
\label{tab:classification}
\resizebox{\linewidth}{!}{
\renewcommand{\arraystretch}{1.}
\begin{tabular}{@{}llccccccccccc@{}}
\toprule
 & Metric & \multicolumn{4}{c}{Function-space priors} & \multicolumn{5}{c}{Weight-space priors} & GP-based \\
\cmidrule(rl){3-6} \cmidrule(rl){7-11} \cmidrule(rl){12-12} 
&  & GFSVI (rnd) & GFSVI (kmnist) & FVI (rnd) & FVI (kmnist) & TFSVI (rnd) & TFSVI (kmnist) & MFVI &VIP & Laplace & GWI \\
\midrule
\multirow{4}{*}{\parbox[t]{0pt}{\multirow{2}{*}{\rotatebox[origin=c]{90}{MNIST\hspace{-1.3em}}}}}
& Log-like.\ ($\uparrow$) & \textbf{-0.033 $\pm$ 0.000} & -0.041 $\pm$ 0.000 & -0.145 $\pm$ 0.005 & -0.238 $\pm$ 0.006 & -0.047 $\pm$ 0.003 & -0.041 $\pm$ 0.001 & -0.078 $\pm$ 0.001 & \textbf{-0.033 $\pm$ 0.001} & -0.108 $\pm$ 0.002 & -0.090 $\pm$ 0.003 \\
& Acc.\ ($\uparrow$) & \textbf{\hphantom{-}0.992 $\pm$ 0.000} & \hphantom{-}0.991 $\pm$ 0.000 & \hphantom{-}0.976 $\pm$ 0.001 & \hphantom{-}0.943 $\pm$ 0.001 & \hphantom{-}0.989 $\pm$ 0.000 & \hphantom{-}0.989 $\pm$ 0.000 & \hphantom{-}0.990 $\pm$ 0.000 & \hphantom{-}0.989 $\pm$ 0.000 & \hphantom{-}0.984 $\pm$ 0.000 & \hphantom{-}0.971 $\pm$ 0.001 \\
& ECE ($\downarrow$) & \textbf{\hphantom{-}0.002 $\pm$ 0.000} & \hphantom{-}0.006 $\pm$ 0.000 & \hphantom{-}0.064 $\pm$ 0.001 & \hphantom{-}0.073 $\pm$ 0.003 & \hphantom{-}0.007 $\pm$ 0.000 & \hphantom{-}0.006 $\pm$ 0.000 & \hphantom{-}0.021 $\pm$ 0.000 & \textbf{\hphantom{-}0.002 $\pm$ 0.001} & \hphantom{-}0.048 $\pm$ 0.001 & \hphantom{-}0.003 $\pm$ 0.000 \\
& OOD acc.\ ($\uparrow$) & \hphantom{-}0.921 $\pm$ 0.008 & \textbf{\hphantom{-}0.980 $\pm$ 0.004} & \hphantom{-}0.894 $\pm$ 0.010 & \hphantom{-}0.891 $\pm$ 0.006 & \hphantom{-}0.887 $\pm$ 0.011 & \hphantom{-}0.893 $\pm$ 0.005 & \hphantom{-}0.928 $\pm$ 0.002 & \hphantom{-}0.871 $\pm$ 0.012 & \hphantom{-}0.903 $\pm$ 0.007 & \hphantom{-}0.829 $\pm$ 0.007 \\
\midrule
\multirow{4}{*}{\parbox[t]{0pt}{\multirow{2}{*}{\rotatebox[origin=c]{90}{FMNIST\hspace{-1.5em}}}}}
& Log-like.\ ($\uparrow$) & -0.260 $\pm$ 0.003 & -0.294 $\pm$ 0.006 & -0.300 $\pm$ 0.002 & -0.311 $\pm$ 0.005 & -0.261 $\pm$ 0.001 & -0.261 $\pm$ 0.002 & -0.290 $\pm$ 0.002 & \textbf{-0.252 $\pm$ 0.001} & -0.426 $\pm$ 0.009 & -0.260 $\pm$ 0.001 \\
& Acc.\ ($\uparrow$) & \hphantom{-}0.910 $\pm$ 0.001 & \hphantom{-}0.909 $\pm$ 0.001 & \hphantom{-}0.910 $\pm$ 0.002 & \hphantom{-}0.906 $\pm$ 0.002 & \hphantom{-}0.909 $\pm$ 0.001 & \hphantom{-}0.907 $\pm$ 0.001 & \textbf{\hphantom{-}0.913 $\pm$ 0.001} & \hphantom{-}0.911 $\pm$ 0.001 & \hphantom{-}0.886 $\pm$ 0.001 & \hphantom{-}0.906 $\pm$ 0.000 \\
& ECE ($\downarrow$) & \hphantom{-}0.020 $\pm$ 0.003 & \hphantom{-}0.042 $\pm$ 0.002 & \hphantom{-}0.027 $\pm$ 0.005 & \hphantom{-}0.024 $\pm$ 0.002 & \hphantom{-}0.022 $\pm$ 0.002 & \hphantom{-}0.021 $\pm$ 0.002 & \textbf{\hphantom{-}0.010 $\pm$ 0.001} & \hphantom{-}0.024 $\pm$ 0.001 & \hphantom{-}0.060 $\pm$ 0.004 & \hphantom{-}0.016 $\pm$ 0.001 \\
& OOD acc.\ ($\uparrow$) & \hphantom{-}0.853 $\pm$ 0.005 & \textbf{\hphantom{-}0.997 $\pm$ 0.001} & \hphantom{-}0.925 $\pm$ 0.005 & \hphantom{-}0.975 $\pm$ 0.002 & \hphantom{-}0.802 $\pm$ 0.006 & \hphantom{-}0.779 $\pm$ 0.010 & \hphantom{-}0.805 $\pm$ 0.010 & \hphantom{-}0.790 $\pm$ 0.010 & \hphantom{-}0.826 $\pm$ 0.006 & \hphantom{-}0.792 $\pm$ 0.005 \\
\bottomrule
\end{tabular}
}
\end{table*}
We evaluate GFSVI on regression, classification, and out-of-distribution detection.
In all tables, we bold the highest score and any score whose error bar (standard error) overlaps with the highest score's error bar.
%
\begin{table}[h!]
\scshape
\centering
\caption{Results for the ocean current modeling task. 
}
\label{tab:ocean_current}
\resizebox{1\linewidth}{!}{
\renewcommand{\arraystretch}{1.}
\begin{tabular}{@{}lcccc@{}}
\toprule
\multicolumn{1}{@{}l}{Metric} & GFSVI (ours) & TFSVI & VIP & GP  \\ 
\midrule
Log-like. & -6.627 $\pm$ 0.753 & -22.651 $\pm$ 2.947 & -11.631 $\pm$ 3.171 & -0.507 $\pm$ 0.000 \\
MSE & \hphantom{-}0.021 $\pm$ 0.002 & \hphantom{-}0.034 $\pm$ 0.003 & \hphantom{-}0.026 $\pm$ 0.001 & \hphantom{-}0.013 $\pm$ 0.000\\
\bottomrule
\end{tabular}
}
\end{table}

\paragraph{Ocean current modeling.}
%
We measure how well GFSVI can incorporate knowledge specified via a GP prior on real-world data by considering the problem of modeling ocean currents in the Gulf of Mexico.
We follow the setup by \citet{shalashilin2024GPocean} and use the GulfDrifters dataset \citep{lilly2021GulfDrifters} to estimate ocean currents from $20$ $2$-dimensional velocity vectors collected from drifter buoys.
We embed physical properties of fluid motions into the GP prior and to the neural networks by applying the Helmholtz decomposition \citep{berlinghieri2023gaussian,cinquin2024fsplaplace}.
We compare our GFSVI to a GP, to TFSVI and to VIP.
More details can be found in \cref{app:sec_ocean_current_details}.
%
We find that incorporating knowledge via an informative GP prior in GFSVI improves performance over weight-space priors in TFSVI and VIP (see \cref{tab:ocean_current} and \cref{fig:ocean_current}). 
However, the GP outperforms both BNNs, which suggests that the physically motivated kernel describes the fluid dynamics well enough that the additional inductive bias introduced by a neural network hurts performance rather than helping it.
In the following, we consider experiments with larger datasets (making exact GP inference computationally infeasible in many cases), and where structural prior knowledge in function space exists but is not derived from laws of nature.
%
\begin{table*}[h!]
\scshape
\caption{Out-of-distribution accuracy (higher is better) of evaluated methods on regression datasets. GFSVI (ours) performs competitively on OOD detection and obtains the highest mean rank.}
\label{tab:ood_detect}
\resizebox{\linewidth}{!}{
\renewcommand{\arraystretch}{1.}
\begin{tabular}{@{}lcccccccccc@{}}
\toprule
\multicolumn{1}{@{}l}{Dataset} & \multicolumn{2}{c}{Function-space priors} & \multicolumn{4}{c}{Weight-space priors} & \multicolumn{3}{c}{Gaussian Processes (Gold Standards)} \\ 
\cmidrule(rl){2-3} \cmidrule(rl){4-7} \cmidrule(rl){8-10}
\multicolumn{1}{@{}l}{} & GFSVI (ours) & \multicolumn{1}{c}{FVI} & TFSVI & MFVI & VIP & Laplace & GWI & Sparse GP & GP  \\ 
\midrule
Boston & \textbf{0.893 $\pm$ 0.011} & 0.594 $\pm$ 0.024 & 0.705 $\pm$ 0.107 & 0.563 $\pm$ 0.013 & 0.628 $\pm$ 0.010 & 0.557 $\pm$ 0.009 & 0.817 $\pm$ 0.017 & 0.947 $\pm$ 0.011 & 0.952 $\pm$ 0.003 & \\
Concrete & \textbf{0.656 $\pm$ 0.016} & 0.583 $\pm$ 0.022 & 0.511 $\pm$ 0.003 & 0.605 $\pm$ 0.012 & 0.601 $\pm$ 0.024 & 0.578 $\pm$ 0.015 & 0.730 $\pm$ 0.020 & 0.776 $\pm$ 0.006 & 0.933 $\pm$ 0.004 & \\
Energy & \textbf{0.997 $\pm$ 0.002} & 0.696 $\pm$ 0.017 & \textbf{0.997 $\pm$ 0.001} & 0.678 $\pm$ 0.014 & 0.682 $\pm$ 0.037 & 0.782 $\pm$ 0.020 & 0.998 $\pm$ 0.001 & 0.998 $\pm$ 0.001 & 0.998 $\pm$ 0.001 & \\
Kin8nm & 0.588 $\pm$ 0.007 & \textbf{0.604 $\pm$ 0.023} & 0.576 $\pm$ 0.008 & 0.570 $\pm$ 0.009 & 0.563 $\pm$ 0.015 & \textbf{0.606 $\pm$ 0.009} & 0.602 $\pm$ 0.011 & 0.608 $\pm$ 0.014 & \textit{(infeasible)} & \\
Naval & \textbf{1.000 $\pm$ 0.000} & \textbf{1.000 $\pm$ 0.000} & \textbf{1.000 $\pm$ 0.000} & 0.919 $\pm$ 0.017 & 0.621 $\pm$ 0.059 & \textbf{1.000 $\pm$ 0.000} & 1.000 $\pm$ 0.000 & 1.000 $\pm$ 0.000 & \textit{(infeasible)} & \\
Power & \textbf{0.698 $\pm$ 0.006} & 0.663 $\pm$ 0.021 & 0.676 $\pm$ 0.008 & 0.636 $\pm$ 0.019 & 0.514 $\pm$ 0.004 & 0.654 $\pm$ 0.013 & 0.754 $\pm$ 0.004 & 0.717 $\pm$ 0.004 & \textit{(infeasible)} & \\
Protein & \textbf{0.860 $\pm$ 0.011} & 0.810 $\pm$ 0.022 & \textbf{0.841 $\pm$ 0.018} & 0.693 $\pm$ 0.020 & \textbf{0.549 $\pm$ 0.020} & 0.629 $\pm$ 0.013 & 0.942 $\pm$ 0.002 & 0.967 $\pm$ 0.001 & \textit{(infeasible)} & \\
Wine & 0.665 $\pm$ 0.013 & 0.517 $\pm$ 0.004 & 0.549 $\pm$ 0.015 & 0.542 $\pm$ 0.009 & \textbf{0.706 $\pm$ 0.028} & 0.531 $\pm$ 0.007 & 0.810 $\pm$ 0.008 & 0.781 $\pm$ 0.014 & 0.787 $\pm$ 0.007 & \\
Yacht & 0.616 $\pm$ 0.030 & 0.604 $\pm$ 0.025 & \textbf{0.659 $\pm$ 0.043} & \textbf{0.642 $\pm$ 0.035} & \textbf{0.688 $\pm$ 0.040} & 0.612 $\pm$ 0.024 & 0.563 $\pm$ 0.014 & 0.762 $\pm$ 0.018 & 0.787 $\pm$ 0.011 & \\
Wave & \textbf{0.975 $\pm$ 0.005} & 0.642 $\pm$ 0.004 & 0.835 $\pm$ 0.034 & 0.658 $\pm$ 0.026 & 0.500 $\pm$ 0.000 & 0.529 $\pm$ 0.005 & 0.903 $\pm$ 0.001 & 0.513 $\pm$ 0.001 & \textit{(infeasible)} & \\
Denmark & 0.521 $\pm$ 0.006 & \textbf{0.612 $\pm$ 0.008} & 0.519 $\pm$ 0.006 & 0.513 $\pm$ 0.003 & 0.500 $\pm$ 0.000 & 0.529 $\pm$ 0.008 & 0.688 $\pm$ 0.003 & 0.626 $\pm$ 0.002 & \textit{(infeasible)} & \\
\midrule
Mean rank & 1.455 & 2.364 & 1.909 & 2.909 & 3.364 & 2.909 & - & - & - \\
\bottomrule
\end{tabular}
}
\end{table*}
%
\paragraph{Regression.}
\label{sec:regression}
%
We assess the predictive performance of GFSVI on data sets from the UCI repository \citep{Dua2019UCI}.
\Cref{tab:expected_ll}, and \Cref{tab:mse} in the appendix, show expected log-likelihood and mean squared error, respectively.
We perform 5-fold cross validation and report means and standard errors across the test folds.  
See \cref{app:sec_regression_details} for more details.
We find that GFSVI performs competitively compared to baselines and obtains the best mean rank for both metrics, matching the top performing methods on nearly all datasets. 
In particular, we find that using GP priors in the linearized BNN with GFSVI yields improvements over the weight-space priors used in TFSVI, and that GFSVI performs slightly better than FVI despite being simpler.
Further, we find that GFSVI approximates the exact GP-posterior more accurately that FVI (see \cref{tab:var_measure_eval} and \cref{app:var_measure_eval}), and that it converges in slightly more steps than TFSVI (\cref{fig:boston_exp_ll_convergence}).
%
\paragraph{Classification.}
%
We further evaluate classification performance of our method on the MNIST \citep{lecun2010mnist} and FashionMNIST \citep{xiao2017FMNIST} image data sets.
We fit the models on a random subset of 90\% of the training set, use the remaining 10\% as validation data, and evaluate on the provided test split. 
We repeat with 5 different random seeds and report the mean and standard error of the expected log-likelihood, accuracy, and expected calibration error (ECE) in \cref{tab:classification}. 
For GFSVI, FVI, and TFSVI, we tested measurement points from both a uniform random (\textsc{rnd}) distribution~$\rho(\vx)$ and from \textsc{kmnist}. Details in \cref{app:sec_classification_details}.
We find that GFSVI performs competitively on MNIST, exceeding the expected log-likelihood and accuracy of top-scoring baselines and similarly to best baselines on FashionMNIST. 
GFSVI also yields well-calibrated models with low ECE.
%
\paragraph{Out-of-distribution detection.}
\label{sec:ood_detection}
%
We next evaluate our method by testing if its epistemic uncertainty is predictive of out-of-distribution (OOD) data.
We consider two settings: (i) with tabular data and a Gaussian likelihood \citep{malinin2021uncertGBM}, and (ii) with image data and a categorical likelihood \citep{osawa2019practical}.
We report the accuracy of classifying OOD vs.\ in-distribution (ID) data using a (learned) threshold on the predictive uncertainty.
More details in \cref{app:sec_ood_details}.
In setting~(i), GFSVI performs competitively and obtains the highest mean rank (\cref{tab:ood_detect}).
Likewise in setting~(ii), GFSVI strongly outperforms all baselines when using the \textsc{kmnist} measurement point distribution~$\rho(\vx)$ (\cref{fig:ood_detection_plot}, \cref{tab:classification,tab:influence_rho_ood_detection}).
We find that with high-dimensional image data, the choice of measurement point distribution highly influences OOD detection accuracy (see Appendix \ref{app:influence_rho_image_data} for a discussion).
In both settings, using GP priors with GFSVI rather than weight-space priors with TFSVI is beneficial, and GFSVI also improves over FVI.
GFSVI's uncertainty is also well-calibrated under distribution shift of the input features (see \cref{app:rotated_image_data}).