\subsection{Synthetic Example}\label{subsec:synthetic_example_app}
\begin{figure}[h]
  \centering
  \begin{subfigure}{0.33\textwidth}
    \centering
    \includegraphics[width=\linewidth]{figure/posterior_approx/DNN-SGD_posterior_approx.png}
    \caption{SGD}
  \end{subfigure}%
  \begin{subfigure}{0.33\textwidth}
    \centering
    \includegraphics[width=\linewidth]{figure/posterior_approx/MCMC-SGLD_posterior_approx.png}
    \caption{MCMC}
  \end{subfigure}%
  \begin{subfigure}{0.33\textwidth}
    \centering
    \includegraphics[width=\linewidth]{figure/posterior_approx/SWAG-SGD_posterior_approx.png}
    \caption{SWAG}
  \end{subfigure}\\[1ex]
    \begin{subfigure}{0.33\textwidth}
    \centering
    \includegraphics[width=\linewidth]{figure/posterior_approx/VI-SGD_posterior_approx.png}
    \caption{VI}
  \end{subfigure}%
  \begin{subfigure}{0.33\textwidth}
    \centering
    \includegraphics[width=\linewidth]{figure/posterior_approx/SWAG-SABMA_posterior_approx.png}
    \caption{FP-BMA (SWAG)}
  \end{subfigure}%
  \begin{subfigure}{0.33\textwidth}
    \centering
    \includegraphics[width=\linewidth]{figure/posterior_approx/VI-SABMA_posterior_approx.png}
    \caption{FP-BMA (VI)}
  \end{subfigure}\\[1ex]
  \caption{Posterior approximation with synthetic example. When both flat and sharp modes coexist, we compared how optimizers approximate the posterior. Unlike other methods, the proposed FP-BMA converged to the flat mode, demonstrating its effectiveness in finding more stable solutions.}
  \label{fig:posterior_approx_all}
\end{figure}

Following \citet{li2023entropy}, we construct a loss surface following the distribution $\frac{1}{2}(\mathcal{N}([-2, -1]^T, 0.5I)) + \frac{1}{2}(\mathcal{N}([2, 1]^T, I))$ and set the initial point at $(-0.4, -0.4)$. Unlike other SGD-based methods, FP-BMA efficiently identifies flat modes regardless of the underlying BNN frameworks.



\clearpage
\subsection{Learning From Scratch}\label{subsec:learning_from_scratch_app}
\subsubsection{FP-BMA with diverse BNN frameworks}\label{subsubsec:sabma_with_diverse_bnn_frameworks_1}
In Eq.~\ref{eq:main_loss}, FP-BMA can be applied with various BNN frameworks by using an empirical loss function $\ell(\cdot)$ and adjusting the parameter $\beta$. We commonly set $\ell(\cdot)$ as cross-entropy loss in context of image classification task. Note that FP-BMA was applied only to the normalization layers and the last layer, while all other layers were trained using SGD.

\paragraph{FP-BMA (VI)}
For VI, we follow the loss function of Eq.~\ref{eq:main_loss}.

\paragraph{FP-BMA (MCMC)}
We mainly adopt SGLD for MCMC in this work. For SGLD, we incorporated noise into Eq.~\ref{eq:main_loss} without KLD term ($\beta = 0$) based on the learning rate and the hyperparameter, temperature. In this approach, during the first step, the adversarial posterior is computed without any noise (Eq.~\ref{eq:Delta_theta_star}). In the second step, both the noise and the adversarial posterior are used together in the learning process.

\paragraph{FP-BMA (SWAG)}
SWAG updates the first and second moments along the trajectory of SWA and uses these moments to approximate the posterior with a Gaussian distribution. In Eq.~\ref{eq:main_loss}, $\beta$ is fixed to 0, and as the trajectory of SWA is optimized through FP-BMA, posterior approximation can be performed accordingly.




\subsubsection{Hyperparameters for Experiments}\label{subsubsec:hyperparameters_for_experiments_1}
In this section, we provide the details of the experimental setup for Section~\ref{subsec:learning_from_scratch}. In the other experiments, the range of hyperparameters, excluding the number of epochs, is shared across different backbones and methods. For all experiments, the hyperparameters are selected using grid-search. Configuration of best hyperparameters for each baseline is summarized in Table~\ref{tab:hyperparameter_c10_scratch} and Table~\ref{tab:hyperparameter_c100_scratch}.

% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% [TABLE] Hyperparameter
% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\input{tables/hyperparameter_c10_scratch}
\input{tables/hyperparameter_c100_scratch}
% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


\paragraph{Stochastic Gradient Descent with Momentum (SGD)}
In this study, we adopt Stochastic Gradient Descent with Momentum as an optimizer for DNN. Learning rate schedule is fixed to cosine decay. We run 300 epochs. The hyperparameter tuning range included learning rate in [1e-4, 1e-3, 1e-2].

\paragraph{Sharpness Aware Minimization (SAM)}
We set SGD with momentum as the base optimizer of SAM. It also ran upon a cosine decay learning rate scheduler. All the range of hyperparameters is shared with SGD with Momenmtum. Additional hyperparameter $\gamma$, the ball size of perturbation, is in [1e-2, 5e-2, 0.1].

\paragraph{Fisher SAM (FSAM)}
We set SGD with momentum as the base optimizer of FSAM. It also ran upon a cosine decay learning rate scheduler. All the range of hyperparameters is shared with SGD with Momenmtum. Additional hyperparameter $\eta$, regularize Fisher impact, is in [1e-2, 1e-1, 1].

\paragraph{SAM as an optimal relaxation of Bayes (bSAM)} We use a cosine learning rate decay scheme. We run 300 epochs with fixed $\beta_1$ and $\beta_2$. The hyperparameter tuning rage included: learning rate in [1e-1, 3e-1, 5e-1, 8e-1, 1], weight decay in [1e-4, 5e-4, 1e-3, 1e-2], damping in [1e-1, 1e-2, 1e-3], and $\gamma$ in [1e-3, 1e-2, 5e-2, 1e-1, 5e-1]. Damping parameter stabilizes the method by adding constant when updating variance estimate.

\paragraph{Variational Inference (VI)} We use MOPED to change DNN into BNN, first. We set prior mean and variance as 0 and 1, respectively. Besides, we set the posterior mean as 0 and variance as 1e-3. We adopt Reparameterization as type of VI. The essential hyperparmeter for MOPED is $\delta$, which adjusts how much to incorporate pre-trained weights. The $\delta$ was searched in [1e-3, 5e-3, 1e-2]. Moreover, we add a hyperparameter $\beta$ for MOPED that can balance the loss term in VI. The $\beta$ is in range [1e-2, 1e-1 ,1]

\paragraph{MCMC} We consistently use SGLD~\citep{welling2011bayesian} for MCMC in this work. It ran upon a cyclic cosine decay learning rate scheduler. The number of cycles was ranged in [2, 4]. The number of sampled models is in [10, 20, 28]. We search temperature in [1e-5, 5e-4, 1e-4, 5e-3, 1e-3, 1e-2].

\paragraph{Entropy-MCMC (E-MCMC)} We use a cosine learning rate decay scheme, annealing the learning rate to zero. We run 300 epochs. We search $\eta$ in [1e-4, 5e-3, 1e-3, 5e-2, 1e-2, 1e-1] and a system temperature $T$ in [1e-4, 5e-4, 1e-3, 5e-3, 1e-2]. Note that the $\eta$ handles flatness, and the system temperature adjusts the weight update's step size.

\paragraph{SWAG} We use a cosine learning rate decay scheme for SWAG. All the range of hyperparameters is shared with SGD with Momenmtum. Additionally, we search for three additional hyperparameters for SWAG, capturing DNN snapshots and calculating statistics. First, the epoch to start SWA is in [161, 201], and epoch is 300. Second, the frequency of capturing the model snapshot is in [1, 2, 3]. Third, the low rank for covariance is in [2, 3, 5, 7, 10].

\paragraph{F-SWAG} F-SWAG shares hyperparameter with SWAG, except $\gamma$. We search $\gamma$ in [1e-2, 5e-2, 1e-1].


\paragraph{FP-BMA} In case of FP-BMA (VI), we set $\mathcal{N}(0, 1e-3)$ as prior and $\delta$ as 1e-3 to make DNN to BNN using MOPED. After getting prior distribution, we search three hyperparameters: learning rate and $\gamma$. The hyperparameter tuning range included: learning rate in [1e-3, 5e-3, 1e-2, 5e-2], $\gamma$ in [1e-2, 5e-2, 1e-1, 5e-1]. We set weight decay as $5e-4$ for all backbones and train the model over 300 epochs with early stopping. We fix $\beta$ as 1e-8 for all experiments. In case of FP-BMA (MCMC), we search learning rate, temperature for learning rate scheduling, and $\gamma$. The hyperparameter ranges are [1e-3, 5e-3, 1e-2, 5e-2] for learning rate, [1e-4, 5e-3, 1e-3, 5e-2, 1e-2, 1e-1] for temperature, and [5e-3, 1e-2, 5e-2, 1e-1, 5e-1] for $\gamma$. In case of FP-BMA (SWAG), we follow the hyperparameter for SWAG, except $\gamma$ in [1e-2, 5e-2, 1e-1].






\clearpage
\subsection{Bayesian Transfer Learning}\label{subsec:few_shot_image_classification_with_bayesian_transfer_learning_app}
\subsubsection{FP-BMA with diverse BNN frameworks}\label{subsubsec:sabma_with_diverse_bnn_frameworks_2}
Diverse BNN frameworks can be adopted for Bayesian Transfer Learning. Specifically, there are several options for making pre-trained DNN into BNN. In this work, we mainly adopt MOPED and SWAG for the converting.

In addition, FP-BMA can be applied with various BNN frameworks by using an empirical loss function
$\ell(\cdot)$ and adjusting the parameter $\beta$ in Eq.~\ref{eq:bayesian_transfer_learning}. We commonly set $\ell(\cdot)$ as cross-entropy loss in context of image classification task.

\paragraph{FP-BMA (VI)}
First, we convert pre-trained DNN into BNN with MOPED. We set the converted BNN as prior, $q_\theta^{\text{pr}}(w|\mathcal{D}^{\text{pr}})$ in Eq.~\ref{eq:bayesian_transfer_learning}, and initial point of model. We only train parameters of normalization and last layer and freeze others. We train them with the loss function of Eq.~\ref{eq:bayesian_transfer_learning}.


\paragraph{FP-BMA (MCMC)}
For SGLD, it is unnecessary to convert pre-trained DNN into BNN. Instead, we directly set the pre-trained DNN as initialization. We incorporated noise into Eq.~\ref{eq:bayesian_transfer_learning} without the KLD term ($\beta = 0$) based on the learning rate and the hyperparameter, temperature. During the first step, the adversarial posterior is computed without any
noise (Eq.~\ref{eq:Delta_theta_star}). In the second step, both the noise and the adversarial posterior are used together in the learning process.


\paragraph{FP-BMA (SWAG)}
SWAG is also one of the options to convert pre-trained DNN into BNN. Specifically, we run a few epochs with source or downstream datasets to make BNN from pre-trained DNN. After this step, we set the BNN as the prior, $q_\theta^{\text{pr}}(w|\mathcal{D}^{\text{pr}})$ in Eq.~\ref{eq:bayesian_transfer_learning}. We also let the converted BNN as initialization and train with downstream dataset. We optimize model with the loss function in Eq.~\ref{eq:bayesian_transfer_learning}.




\subsubsection{Hyperparameters for Experiments}\label{subsubsec:hyperparameters_for_experiments}
In this section, we provide the details of the experimental setup for Section~\ref{subsec:few-shot_image_classification}. In the other experiments, the range of hyperparameters, excluding the number of epochs, is shared across different backbones and methods.

First, we provide remarks for each baseline method, followed by the tables of hyperparameter configuration with respect to downstream datasets and the baselines. For all experiments, the hyperparameters are selected using grid-search. Configuration of best hyperparameters for each baseline is summarized in Table~\ref{tab:hyperparameter_c10} and Table~\ref{tab:hyperparameter_c100}. We ran all experiments using GeForce RTX 3090 and NVIDIA RTX A6000 with GPU memory of 24,576MB and 49,140 MB.

% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% [TABLE] Hyperparameter
% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\input{tables/hyperparameter_c10}
\input{tables/hyperparameter_c100}
% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


\paragraph{Stochastic Gradient Descent with Momentum (SGD)}
In this study, we adopt Stochastic Gradient Descent with Momentum as an optimizer for DNN. Learning rate schedule is fixed to cosine decay with warmup length of 10. We tested [100, 150] epoch and set 100 epoch as the best option. In overall experiments, we set momentum as 0.9. The hyperparameter tuning range included learning rate in [1e-4, 1e-3, 1e-2], and weight decay in [1e-4, 5e-4, 1e-3, 1e-2].

\paragraph{Sharpness Aware Minimization (SAM)}
We set SGD with momentum as the base optimizer of SAM. It also ran upon a cosine decay learning rate scheduler. All the range of hyperparameters is shared with SGD with Momenmtum. Additional hyperparameter $\gamma$, the ball size of perturbation, is in [1e-2, 5e-2, 1e-1].

\paragraph{Fisher SAM (FSAM)}
We set SGD with momentum as the base optimizer of FSAM. It also ran upon a cosine decay learning rate scheduler. All the range of hyperparameters is shared with SGD with Momenmtum. Additional hyperparameter $\eta$, regularize Fisher impact, is in [1e-2, 1e-1, 1].

\paragraph{SAM as an optimal relaxation of Bayes (bSAM)} We use a cosine learning rate decay scheme, annealing the learning rate to zero. We fine-tuned pre-trained models for 150 epochs with fixed $\beta_1$ and $\beta_2$. The hyperparameter tuning range included: learning rate in [1e-3, 1e-2, 5e-2, 1e-1, 0.25, 0.5, 1], weight decay in [1e-3, 1e-2, 1e-1], damping in [1e-3, 1e-2, 1e-1], noise scaling parameter in [1e-4, 1e-3, 1e-2, 1e-1], and $\gamma$ in [1e-3, 1e-2, 5e-2, 1e-1]. Damping parameter stabilizes the method by adding constant when updating variance estimate. Since SAM as Bayes optimizer depends on the number of samples to scale the prior, we introduced additional noise scaling parameters to mitigate the gap between the experimental settings, where SAM as Bayes assumed training from scratch and our method assumed few-shot fine-tuning on the pre-trained model. We multiplied noise scaling parameter to the variance of the Gaussian noise to give strong prior, assuming pre-trained model.

\paragraph{Model Priors with Empirical Bayes using DNN (MOPED)} MOPED was a baseline to compare for Bayesian Transfer Learning. It employs pre-trained DNN and transforms it into Mean-Field Variational Inference (MFVI). We set prior mean and variance as 0 and 1, respectively. Besides, we set the posterior mean as 0 and variance as 1e-3. We adopt Reparameterization as type of VI. The essential hyperparameter for MOPED is $\delta$, which adjusts how much to incorporate pre-trained weights. The $\delta$ was searched in [5e-2, 1e-1, 2e-1]. Moreover, we add a hyperparameter $\beta$ for MOPED that can balance the loss term in VI. The $\beta$ is in range [1e-2, 1e-1, 1].

\paragraph{MCMC} We consistently use SGLD~\citep{welling2011bayesian} for MCMC in this work. It ran upon a cyclic cosine decay learning rate scheduler. The number of cycles was ranged in [2, 4]. The number of sampled models is in [10, 20, 28]. We search temperature in [1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1].

\paragraph{Pre-train Your Loss (PTL)} The backbones both ResNet18 and Vit-B/16 were refined through fine-tuning with a classification head for the target task, leveraging a prior distribution learned from SWAG on the ImageNet 1k dataset using SGD. First, the hyperparameter tuning range of the pre-training epoch is [2, 3, 5, 15, 30] to generate the prior distribution on the source task, ImageNet 1k. The learning rate was 0.1. We approximated the covariance low rank as 5. Second, in the downstream task, the fine-tuning optimizer is SGLD with a cosine learning rate schedule, sampling 30 in 5 cycles. The hyperparameter tuning range included: learning rate in [1e-4, 1e-3, 1e-2, 5e-2, 6e-2, 1e-1, 5e-1], weight decay in [1e-4, 1e-3 ,1e-2 ,1e-1], and prior scale in [1e+4, 1e+5, 1e+6]. Prior scaling in the downstream task is to reflect the mismatch between the pre-training and downstream tasks and to add coverage to parameter settings that might be consistent with the downstream. Training was conducted over 150 epochs; tuning range of fine-tuning epoch is [100, 150, 200, 300, 1000].


\paragraph{Entropy-MCMC (E-MCMC)} We use a cosine learning rate decay scheme, annealing the learning rate to zero. We set the range of the hyperparameter sweep to the surroundings of the best hyperparameter in E-MCMC for ResNet18: learning rate in [5e-3, 5e-2, 5e-1], weight decay in [1e-4, 1e-3, 1e-2], $\eta$ in [1e-6, 5e-6, 1e-5, 5e-5, 1e-4, 4e-4, 5e-3, 8e-3, 1e-2] and a system temperature $T$ in [1e-5, 1e-4, 1e-3]. In this study, we performed an extensive exploration of the hyperparameter space of ViT-B/16, as it has a mechanism different from the CNN family and may not be found near the best hyperparameter range of ResNet18: learning rate in [1e-3, 5e-3, 1e-2, 5e-2, 5e-1], weight decay in [1e-5, 1e-4, 5e-4, 1e-3, 1e-2, 5e-2], $\eta$ in [5e-7, 1e-6, 5e-6, 5e-5, 1e-4, 4e-4, 5e-4, 1e-3, 8e-3, 1e-2, 1e-1] and a system temperature $T$ in [1e-6, 5e-6, 1e-5, 5e-5, 1e-4, 1e-3, 1e-2, 1e-1]. We fine-tuned pre-trained models for 150 epochs. Note that the $\eta$ handles flatness, and the system temperature adjusts the weight update's step size.


\paragraph{SWAG} We use a cosine learning rate decay scheme for SWAG. All the range of hyperparameters is shared with SGD with Momenmtum. Additionally, we search three additional hyperparameters for SWAG, capturing DNN snapshots and calculating statistics. First, the epoch to start SWA is in [51, 76, 101] and epoch is in [100, 150]. Second, the frequency to capture the model snapshot is in [1, 2, 3]. Third, the low rank for covariance is in [2, 3, 5, 7, 10].



\paragraph{F-SWAG} F-SWAG shares hyperparameter with SWAG, except $\gamma$. We search $\gamma$ in [1e-2, 5e-2, 1e-1].



\paragraph{FP-BMA} In case of FP-BMA (SWAG), we train SWAG on source task IN 1K to make prior distribution and follow the pre-training protocol of PTL. In case of employing MOPED to make prior distribution, we do not go through any training step. In case of FP-BMA (VI), we just set $\delta$ as 0.05 for MOPED and make DNN into BNN. In case of FP-BMA (MCMC), we just set pre-trained weight as initialization and run experiments. After getting prior distribution, we search three hyperparameters: learning rate, $\gamma$, and $\alpha$. The hyperparamter tuning range included: learning rate in [1e-3, 5e-3, 1e-2, 5e-2], $\gamma$ in [5e-3, 8e-3, 1e-2, 5e-2, 1e-1, 5e-1, 7e-1], and $\alpha$ in [1e-6, 1e-5, 1e-4, 1e-3]. We set weight decay as $5e-4$ for all backbones and train the model over 150 epochs with early stopping. We fix $\beta$ as 1e-8 for all experiments.




\subsection{Algorithm of FP-BMA}\label{subsec:algorith_of_fpbma}
% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Algorithm of FP-BMA
% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{algorithm}
\caption{FP-BMA with Bayesian Transfer Learning}\label{alg:FP-BMA}
\begin{algorithmic}
\Require Variational parameter $\theta$, Neighborhood size $\gamma$, Epochs $E$, and Learning rate $\eta_{\text{FP-BMA}}$ 
\State 1) Load pre-trained DNN
\State 2) Make pre-trained DNN model into BNN $q_\theta^{\text{pr}}(w|\mathcal{D}^{\text{pr}})$ and set as prior
\For{$t \ = \ 1, 2, ... , E$}
    \State 3-1) $w \sim q_\theta(w|\mathcal{D}^{\text{ft}})$\Comment{Sample weight from posterior}
    \State 3-2) Forward and calculate the loss $\ell(\theta)$ with the sampled $w$ 
    \State 3-3) Backward pass and compute $\nabla_\theta \log q_\theta (w|\mathcal{D})$
    \State 3-4) Compute $F_\theta^{-1} (\theta) = \frac{\nabla_\theta \log q_\theta (w|\mathcal{D}) \nabla_\theta \log q_\theta (w|\mathcal{D})^T}{\| \nabla_\theta \log q_\theta(w|\mathcal{D})\|^4}$
    \State 3-5) Compute the perturbation $\Delta\theta_{\text{FP-BMA}} = \gamma \frac{F_\theta(\theta)^{-1} \nabla_\theta \ell(\theta)}{\sqrt{\nabla_\theta \ell(\theta)^T F_\theta(\theta)^{-1} \nabla_\theta \ell(\theta)}}$
    \State 3-6) Compute gradient approximation for the FP-BMA $\nabla_\theta  \ell^\gamma_{\text{FP-BMA}} (\theta) = \frac{\partial \ell(\theta)}{\partial \theta} |_{\theta + \Delta\theta_{\text{FP-BMA}}}$ \\ 
    \State 3-7) Update $\theta \rightarrow \theta - \eta\nabla_\theta \ell_{\text{FP-BMA}}(\theta)$
\EndFor
\end{algorithmic}
\end{algorithm}
% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Training algorithm of FP-BMA with Bayesian Transfer Learning can be depicted as Algorithm~\ref{alg:FP-BMA}. In the first step, load a model pre-trained on the source task. Note that the pre-trained models do not have to be BNN. Namely, it is capable of using DNN, which can be easier to find than pre-trained BNN. Second, change the loaded DNN into BNN on the source or downstream task. Every BNN framework can be adopted to make DNN into BNN. We can skip this second step if you load a pre-trained BNN model before. Third, train the subnetwork of the converted BNN model with the proposed flat-seeking seeking optimizer. It allows model to converge into flat minina efficiently.





\subsection{Efficiency of FP-BMA}\label{subsec:efficieny_of_fpbma}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% [TABLE] Efficiency
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{wraptable}{r}{0.4\textwidth}
\centering
\vspace{-4.5em}
\begin{tabular}{lccc}
\toprule
\textbf{Method} & \textbf{Time} & \textbf{Wall} & \textbf{Mem.} \\
                & \textbf{Comp.} & \textbf{Clock} & \textbf{Comp.} \\
\midrule
SGD & $O(p)$ & 2.78s & $O(p)$ \\
SAM & $O(2p)$ & 4.58s & $O(p)$ \\
FSAM & $O(2p)$ & 4.65s & $O(2p)$ \\
bSAM & $O(2p)$ & 4.62s & $O(3p)$ \\ \midrule
MF VI & $O(2p)$ & 4.09s & $O(2p)$ \\
FF VI & $O(p^2)$ & -- & $O(p^2)$ \\ \midrule
MCMC & $O(p)$ & 2.95s & $O(Mp)$ \\
E-MCMC & $O(2p)$ & 5.13s & $O(Mp)$ \\ \midrule
SWAG & $O(p)$ & 7.89s & $O(Kp)$ \\
F-SWAG & $O(2p)$ & 11.48s & $O(Kp)$ \\
\textbf{FP-BMA} & $O(2p)$ & 6.21s & $O(Kp_1)$ \\
\bottomrule
\end{tabular}
\caption{Time and memory complexity for all methods.}
\label{tab:efficiency}
\vspace{-9em}
\end{wraptable}


The following Table~\ref{tab:efficiency} summarizes the per-epoch wall-clock time, theoretical time complexity, and memory usage across methods under a unified experimental setting. Evaluation conducted on ResNet-18 with CIFAR-10 10-shot classification. AMP (automatic mixed precision) was enabled for fair efficiency comparison.

\textbf{Notation:}
\begin{itemize}
    \item $p$: total number of model parameters
    \item $p_1$: number of trainable parameters used in FP-BMA subnetwork ($p_1 \ll p$)
    \item $M$: number of MCMC samples
    \item $K$: rank for low-rank approximations (e.g., in SWAG or FP-BMA)
\end{itemize}

To ensure practical efficiency, FP-BMA is implemented with a subnetwork strategy and inverse vector product approximation (as shown in Algorithm~\ref{alg:FP-BMA}). These design choices allow us to limit both runtime and memory overhead, which we found to be comparable to standard baselines.







% \clearpage
\subsection{Fine-Grained Image Classification}\label{subsec:fine-grained_image_classification_app}
In addition to classification accuracy, FP-BMA shows superior performance compared to the baseline in NLL metric, indicating that FP-BMA effectively quantifies uncertainty.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% [TABLE] Fine-grained
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\input{tables/fine_grained_nll.tex}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%



% % \newpage
\subsection{Performance under Distribution shift}\label{subsec:performance_under_distribution_shift_app}
We adopt the corrupted dataset CIFAR10/100C to test the robustness over distribution shift. The corrupted dataset transform the CIFAR10/100-test dataset, which has been modified to shift the distribution of the test data further away from the training data. It contains 19 kinds of corrupt options, such as varying brightness or contrast to adding Gaussian noise. The severity level indicates the strength of the transformation and is typically expressed as a number from 1 to 5, where the higher the number, the stronger the transformation. In Figure~\ref{fig:severity_nll}, our method ensures relatively robust performance in the data distribution shift, even as the severity increases.

\begin{figure}[h]
    \centering
    \includegraphics[width=1\textwidth]{figure/severity_nll_plot.png}
    \caption{NLL performance of ResNet 18 and ViT-B/16 on corrupted CIFAR10 and CIFAR100, respectively \citep{hendrycks2019benchmarking}.}
    \label{fig:severity_nll}
\end{figure}


We also provide the detailed results of three repeated experiments with corrupted sets.

\input{tables/corrupted_set.tex}


\clearpage
\subsection{Comparison with Diverse Baselines and Inference Methods}\label{subsec:diverse_baselines}
To further validate the broad applicability and effectiveness of FP-BMA, we compare it with a variety of inference algorithms and baselines, including MCMC-based, multi-modal, and advanced VI-based methods. All results are reported for CIFAR-10 (10-shot) with ResNet-18.

\begin{table}[h]
    \centering
    \begin{tabular}{lccc}
        \toprule
        Method & Acc (\%) $\uparrow$ & ECE $\downarrow$ & NLL $\downarrow$ \\
        \midrule
        SGHMC & 55.41$_{\pm 0.88}$ & 0.112$_{\pm 0.009}$ & 1.371$_{\pm 0.025}$ \\
        \textbf{SGHMC + FP-BMA (Ours)} & \textbf{56.41}$_{\pm 1.75}$ & \textbf{0.055}$_{\pm 0.008}$ & \textbf{1.276}$_{\pm 0.021}$ \\
        \midrule
        MoLA & 65.77 & 0.045 & 1.058 \\
        \textbf{MoLA + FP-BMA (Ours)} & \textbf{66.77} & 0.063 & \textbf{0.998} \\
        \midrule
        IVON & 56.23$_{\pm 1.01}$ & 0.023$_{\pm 0.004}$ & 1.262$_{\pm 0.037}$ \\
        \textbf{FP-BMA (VI)} & \textbf{64.98}$_{\pm 1.37}$ & \textbf{0.016}$_{\pm 0.007}$ & \textbf{0.997}$_{\pm 0.046}$ \\
        \bottomrule
    \end{tabular}
    \caption{Comparison of FP-BMA with various inference baselines. All results are based on CIFAR-10 (10-shot) and ResNet-18.}
\end{table}

The table above demonstrates that \textbf{FP-BMA consistently improves predictive performance and calibration across a range of inference backbones and posterior structures:}
\begin{itemize}
    \item When applied on top of \textbf{SGHMC}~\citep{chen2014stochastic} (a standard MCMC method), FP-BMA yields clear improvements in accuracy, ECE, and NLL. This shows that our approach is compatible with and beneficial to MCMC-based inference, extending its utility beyond VI-based methods.
    \item In a multi-modal posterior setting (\textbf{MoLA}~\citep{eschenhagen2021mixtures}), FP-BMA remains effective, improving accuracy and NLL. However, the gains are less pronounced than in unimodal cases, suggesting that further extension of FP-BMA for multi-modal posteriors could be fruitful.
    \item Compared to \textbf{IVON}~\citep{shen2024variational} (which leverages efficient second-order optimization but does not explicitly encourage flatness), FP-BMA achieves significantly better results on all metrics. This highlights the effectiveness of explicitly promoting posterior flatness in Bayesian model averaging.
\end{itemize}

Overall, these results support the broad applicability and complementary nature of FP-BMA, demonstrating its value as a general-purpose enhancement for Bayesian inference, regardless of the underlying approximation strategy.






\clearpage
\subsection{Loss Surface Of Sampled Model}\label{subsec:loss_surface_replic}
\begin{figure}[h]
  \centering
  \begin{subfigure}{0.8\textwidth}
    \centering
    \includegraphics[width=\linewidth]{figure/loss_surface/loss_surface_Blues_seed1.png}
    \caption{seed 1}
  \end{subfigure}\\%
    \begin{subfigure}{0.8\textwidth}
    \centering
    \includegraphics[width=\linewidth]{figure/loss_surface/loss_surface_Blues_seed2.png}
    \caption{seed 2}
  \end{subfigure}\\%
  \begin{subfigure}{0.8\textwidth}
    \centering
    \includegraphics[width=\linewidth]{figure/loss_surface/loss_surface_Blues_seed3.png}
    \caption{seed 3}
  \end{subfigure}\\%
  \begin{subfigure}{0.8\textwidth}
    \centering
    \includegraphics[width=\linewidth]{figure/loss_surface/loss_surface_Blues_seed4.png}
    \caption{seed 4}
  \end{subfigure}\\[1ex]
  \caption{Four instances of sampled weights, including (b) as presented in Figure \ref{fig:loss_surface}. Across all plots, it is consistently observed that FP-BMA converges to a flatter loss surface compared to PTL.
}
\label{fig:loss_surfaces_seeds}
\end{figure}

As shown in Figure \ref{fig:loss_surface}, we sampled four model parameters from the posterior, which were trained on CIFAR10 with RN18. It shows the consistent and robust trend of flatness of FP-BMA in the loss surface. In Figure \ref{fig:loss_surfaces_seeds}, commencing with the leftmost panel, a 3D surface plot illustrates the loss surface, revealing the FP-BMA model's comparatively flatter topology against the PTL model. This initial plot intuitively demonstrates that the FP-BMA model exhibits a flatter loss surface compared to the PTL model. Following this, the second visualization compresses the information along a diagonal plane into a 1D scatter plot. This transformation reveals areas obscured in the 3D view, highlighting that FP-BMA maintains a considerably flatter and lower-loss landscape. The third and fourth images showcase the loss surface through 2D contour plots, from which one can easily discern that the area representing the lowest loss is significantly more expansive for FP-BMA than for PTL.

