\subsection{Synthetic Example}\label{subsec:synthetic_example}
We demonstrate whether the proposed FP-BMA can estimate a flat posterior when a sharp and flat minima coexists. To this end, we consider a synthetic dataset generated from true posterior having flat mode and sharp mode, as depicted in Figure~\ref{fig:posterior_approx}. This controlled setting allows us to directly observe the optimizer's preference for flat versus sharp regions in the loss landscape, isolating the effect of posterior flatness from other confounding factors. We then estimate the posterior using the proposed loss FP-BMA using SWAG. For comparison, we consider the following baseline methods: SGD, MCMC, SWAG, and VI, to estimate posterior. Figure~\ref{fig:posterior_approx} shows that MCMC, SWAG, and VI yield the posterior at sharp mode. In contrast, the proposed FP-BMA captures the flat posterior, demonstrating its effectiveness in identifying solutions with better generalization potential. We provide additional results in Appendix~\ref{subsec:synthetic_example_app}.



%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% [TABLE] Few-shot Image Classification
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\input{tables/few-shot_rn18_vitb16.tex}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% [TABLE] Fine-grained
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\input{tables/few-shot_fgvc.tex}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


\subsection{Learning from scratch}\label{subsec:learning_from_scratch}
We verify the effectiveness of FP-BMA in improving the performance of BNNs trained from scratch. Specifically, we use Bayesian ResNet18 and a modified ViT-B/16$^{\dagger}$~\citep{dosovitskiy2020image, liu2021efficient, zhu2023understanding} on CIFAR10 and CIFAR100. We adopt the modified ViT-B/16$^{\dagger}$ to address the underfitting issue of ViTs on small datasets. Due to computational constraints in large-scale models, we apply variational distributions to the parameters of normalization and last layers. We then train these variational parameters using approximate Bayesian inference (VI, MCMC, and SWAG) with the gradient $\nabla_\theta \ell^\gamma_{\text{FP-BMA}}(\theta)$ in Eq.~\ref{eq:update_sabma}, while updating the remaining parameters using the gradient $\nabla_\theta \ell(\theta)$. This setup allows us to assess the benefits of FP-BMA in both convolutional and transformer-based architectures under realistic training constraints.

For comparison, we consider SGD, SAM~\citep{foret2020sharpness}, and FSAM~\citep{kim2022fisher} seeking flat minima in DNNs. For the training of BNNs, we consider SWAG, VI, F-SWAG~\citep{nguyen2023flat}, bSAM~\citep{mollenhoff2022sam}, and E-MCMC~\citep{li2023entropy}, which utilizes SGLD. For fair comparison, we use the same BNN architecture employed for FP-BMA. All baseline methods are carefully tuned with respect to their key hyperparameters to ensure a fair and meaningful comparison of generalization performance, and the detailed hyperparameter configurations for each baseline are provided in Appendix~\ref{subsubsec:hyperparameters_for_experiments_1}.

Table~\ref{tab:scratch_r18_vitb16} showcases the generalization performance, including accuracy (ACC), ECE, and NLL. The FP-BMA consistently improves performances when integrated with VI, MCMC, and SWAG. Also, The FP-BMA  leads to superior performances compared to other baselines of SGD, SAM, FSAM, and bSAM. Additional experimental details are provided in Appendix~\ref{subsec:learning_from_scratch_app}. 




\subsection{Bayesian Transfer Learning}\label{subsec:few-shot_image_classification}


\paragraph{Finetuning on CIFARs}
We validate the effectiveness of the FP-BMA on a transfer learning task. We first adopt RN18 and ViT-B/16 pre-trained on ImageNet (IN) 1K~\citep{russakovsky2015imagenet} as a backbone. The pre-trained models are fine-tuned on CIFAR10 and CIFAR100 10-shot, using 10 data instances per class.

For comparison, we consider the following Bayesian transfer learning methods: MOPED~\citep{krishnan2020specifying} and Pre-Train Your Loss (PTL)~\citep{shwartz2022pre}. We describe additional configurations in Appendix~\ref{subsec:few_shot_image_classification_with_bayesian_transfer_learning_app}.

Table~\ref{tab:r18_vitb16} shows 
FP-BMA with diverse BNN frameworks consistently outperforms existing baselines in terms of both accuracy and uncertainty quantification. Unlike scratch learning, FP-BMA (VI) outperforms FP-BMA (SWAG) in few-shot image classification tasks. This can be attributed to the nature of few-shot tasks, where VI, which only learns a diagonal covariance, is less prone to underfitting due to the limited amount of data. 


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% [TABLE] IN variants
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\input{tables/in_variants.tex}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Acc of C10/100C on RN18/ViT-B/16
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure*}[ht]
\centering
\includegraphics[width=\textwidth]{figure/severity_plot.png}
\caption{Accuracy under distributional shift. We evaluate the accuracy of RN18 and ViT-B/16 models trained on CIFAR10 and CIFAR100 10-shot across all severity levels of CIFAR10C and CIFAR100C. FP-BMA consistently outperforms all baseline methods across all levels of corruption.}  
\label{fig:distribution_shift}
\end{figure*}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\paragraph{Fine-tuning on fine-grained image classification tasks}
Furthermore, we confirm the effectiveness of FP-BMA on general fine-grained image classification tasks, including EuroSAT~\citep{helber2019eurosat}, Flowers102~\citep{nilsback2008automated}, Pets~\citep{parkhi2012cats}, and UCF101~\citep{soomro2012ucf101}. All experiments were conducted using a 16-shot setting across all datasets. From this point forward, we perform all experiments using FP-BMA with SWAG only.

Table~\ref{tab:fine-grained} shows that the FP-BMA achieves the best accuracy. Table~\ref{tab:fine-grained_nll} (Appendix~\ref{subsec:fine-grained_image_classification_app}) shows that the FP-BMA achieves the best NLL, as well. This implies that FP-BMA seeking the flat posterior during fine-tuning procedure is effective in improving the performance of Bayesian transfer learning.


\paragraph{Fine-tuning with CLIP}
We also show the effectiveness of FP-BMA on the pre-trained vision language models.  We fine-tune only the last layer of the CLIP visual encoder on the IN 1K 16-shot dataset. Then, we evaluate the trained model on IN and its variants—IN-V2~\citep{recht2019imagenet}, IN-R~\citep{hendrycks2021many}, IN-A~\citep{hendrycks2021natural}, and IN-S~\citep{wang2019learning}—following the protocols outlined in~\citet{radford2021learning, zhu2023enhancing}.

Table~\ref{tab:IN_variants} shows that FP-BMA outperforms baselines on IN set. Also, FP-BMA shows superior or comparable accuracy on out-of-distribution datasets, representing the effectiveness of robustness.  




\subsection{Robustness on Distribution Shift}\label{subsec:robustness_on_distribution_shift}
We evaluate the trained models on CIFAR10 and CIFAR100 10-shots using the corrupted datasets CIFAR10C and CIFAR100C~\citep{hendrycks2019benchmarking} to demonstrate the robustness of FP-BMA. These benchmarks simulate a wide variety of real-world corruptions, including noise, blur, weather, and digital effects, thereby providing a comprehensive testbed for evaluating model reliability under distribution shift.

Figure~\ref{fig:distribution_shift} presents the accuracy on the corrupted datasets CIFAR10C and CIFAR100C~\citep{hendrycks2019benchmarking}, demonstrating that FP-BMA outperforms baselines on corrupted datasets across all corruption levels. FP-BMA consistently outperforms all baselines in NLL, as shown in Figure~\ref{fig:severity_nll}. Detailed results are provided in Appendix~\ref{subsec:performance_under_distribution_shift_app}.

The results on IN variants in Table~\ref{tab:IN_variants} and the corrupted datasets in Figure~\ref{fig:distribution_shift} show that FP-BMA enhances the robustness of trained BNNs under distribution shifts, suggesting that the Flat Posterior-aware Bayesian Transfer Learning scheme with FP-BMA effectively improves robustness.






\subsection{Flatness Analysis}\label{subsec:flatness_analysis}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Figure: Flatness Anaylsis
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure*}[ht]
\captionsetup{skip=0pt}
\centering
\includegraphics[width=\textwidth]{figure/loss_surface/loss_surface_Blues_seed2.png} 
\caption{Comparison of the loss surfaces of FP-BMA (grey) and PTL (light blue) models. The comparison of loss surface shows that FP-BMA allows the posterior to be placed on a lower and flatter loss surface compared to that of PTL.}
\label{fig:loss_surface}
\end{figure*}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Table: Hessian
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\input{tables/hessian.tex}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

We analyze whether FP-BMA encourages the posterior of BNNs to lie in a flatter loss basin. Using ResNet18 trained on CIFAR10 with 10-shot, we compare weight samples from the approximate posterior obtained by FP-BMA and PTL and compare the Hessian's eigenvalue of model.

Figure~\ref{fig:loss_surface} presents different views of loss surface using sampled weights of FP-BMA and PTL. This result confirms that the posterior of FP-BMA is placed on a flatter loss basin with lower loss. Additional results and the protocol to visualize the loss basin are provided in Appendix~\ref{subsec:loss_surface_replic}.

Table~\ref{tab:hessian} compares the Hessian's eigenvalue of model $\lambda_i$ (Eq.~\ref{eq:bma_hessian}) where
$\lambda_1$ and $\lambda_5$ represent the largest eigenvalue and the fifth largest eigenvalue, respectively. This result indicates that FP-BMA achieves the lowest values compared to all baselines, implying that the posterior of BNNs is formed on the flattest local surface. This further supports our empirical observations that FP-BMA enhances generalization by encouraging a flatter posterior distribution.