\subsection{Details About Hessian Eigenvalue of loss with BMA}\label{subsec:details_about_eigenvalue_of_bnn_hessian}
\begin{figure}[h]
  \centering
  \begin{subfigure}{0.6\textwidth}
    \centering
    \includegraphics[height=5cm]{figure/flatness_description.png}
    \captionsetup{justification=centering}
    \caption{Flatness of BMA}
    \label{fig:flatness_description}
  \end{subfigure}%
  \begin{subfigure}{0.4\textwidth}
    \centering
    \includegraphics[height=4.7cm]{figure/eigenvalue_description.png}
    \captionsetup{justification=centering}
    \caption{Hessian Eigenvalue of loss}
    \label{fig:eigenvalue_description}
  \end{subfigure}
  \caption{Description of flatness of BMA and Hessian Eigenvalue of loss. (a) depicts how flatness is measured in BNNs. We measure the flatness of individual sampled model weights and subsequently ensemble the flatness of them. (b) represents how the Hessian eigenvalue of loss corresponds to flatness. It reveals that direction of steep curvature (sharp minima) exhibits with larger eigenvalues, while that of gentle curvature (flat minima) exhibits smaller eigenvalues. Based on this understanding, we measure flatness using the maximal eigenvalue of the Hessian at the minima.}
  \label{fig:description}
\end{figure}

To measure the flatness of BNNs and compare them with DNNs, we introduce a new metric specifically designed for this study. Unlike DNNs, where model parameters are typically treated as point estimate, BNNs represent model parameters as random variables, necessitating an appropriate approach for measuring flatness. As shown in Figure~\ref{fig:eigenvalue_description}, the maximal eigenvalue of the Hessian of the loss function is commonly used to evaluate flatness quantitatively in DNNs~\citep{keskar2016large, foret2020sharpness, jastrzebski2020break}. To assess flatness in BNNs, we followed BMA protocol. BMA samples model weights from the approximated posterior, calculates the outputs of the sampled individual models, and ensemble the outputs, as shown in Figure~\ref{fig:flatness_description}. Thus, similar to how BMA operates, we measured the flatness of individual model weights and subsequently ensemble these measurements to derive a comprehensive metric.


\subsection{Need For Flatness In BMA}\label{subsec:need_for_flatness_in_bma_app}
\paragraph{Experimental Details}
To measure the flatness of BNNs, $M$ of Eq.~\ref{eq:bma_hessian} is set to 30 for experiments in Section~\ref{subsec:need_for_flatness_in_bma}. We primarily use RN18 as the backbone. Our evaluation includes Error ($100 - \text{Accuracy}$), Expected Calibration Error (ECE)~\citep{guo2017calibration}, and Negative Log-Likelihood (NLL) to assess generalization on CIFAR10 and CIFAR100. To minimize confounding effects on flatness measurements, we do not adjust BN and data augmentation. For BNN frameworks, we consider VI, MCMC, and SWAG. We also consider three different learning rate scheduler: Constant, Cosine Decay (Cos Deacy), and SWAG learning rate (SWAG lr).


\subsubsection{Correlation Between Flatness And Generalization}\label{subsubsec:correlation_between_flatness_and_generalization}

We check the correlation between flatness and generalization performance of sampled models throughout all considered learning rate schedulers. We present the scatter plot of the model, sampled from ResNet18 trained on CIFAR10 and CIFAR100 in the first and second rows of Figure~\ref{fig:additional_corr_plot}. Each column of Figure~\ref{fig:additional_corr_plot} denotes Constant scheduler, Cosine Decay scheduler, and SWAG lr scheduler, respectively. All the models are trained with SWAG and SGD momentum, and we set maximal eigenvalue $\lambda_1$ as a flatness measure. Correlation with flatness and each generalization performance metric is suggested in the legend, as well. Regardless of the scheduler and dataset, all generalization performances, error, ECE, and NLL strongly correlate with flatness.

% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% %%% Figure: additional correlation
% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure}[th]
  \centering
  \begin{subfigure}{0.33\textwidth}
    \centering
    \includegraphics[width=\linewidth]{figure/corr/cifar10_swag-sgd-constant_corr.png}
    \caption{Constant}
  \end{subfigure}%
  \begin{subfigure}{0.33\textwidth}
    \centering
    \includegraphics[width=\linewidth]{figure/corr/cifar10_swag-sgd-cos_decay_corr.png}
    \caption{Cosine Decay}
  \end{subfigure}%
  \begin{subfigure}{0.33\textwidth}
    \centering
    \includegraphics[width=\linewidth]{figure/corr/cifar10_swag-sgd-swag_lr_corr.png}
    \caption{SWAG lr}
  \end{subfigure}\\[1ex]
  \begin{subfigure}{0.33\textwidth}
    \centering
    \includegraphics[width=\linewidth]{figure/corr/cifar100_swag-sgd-constant_corr.png}
    \caption{Constant}
  \end{subfigure}%
  \begin{subfigure}{0.33\textwidth}
    \centering
    \includegraphics[width=\linewidth]{figure/corr/cifar100_swag-sgd-cos_decay_corr.png}
    \caption{Cosine Decay}
  \end{subfigure}%
  \begin{subfigure}{0.33\textwidth}
    \centering
    \includegraphics[width=\linewidth]{figure/corr/cifar100_swag-sgd-swag_lr_corr.png}
    \caption{SWAG lr}
  \end{subfigure}
  \caption{Correlation between maximal eigenvalue and performances of 30 sampled models from SWAG throughout all considered schedulers. It shows classification error, ECE, and NLL are distinctly correlated with flatness. We conjecture that the flatness is crucial for the generalization performance of BNN}
  \label{fig:additional_corr_plot}
\end{figure}
% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


\subsection{Insufficient Flatness of BMA}\label{subsec:insufficient_flatness_of_bma_app}
% \subsubsection{Flatness and generalization according to the training methods}
Figure~\ref{fig:sgd_to_fpbma_cifar10} and Figure~\ref{fig:sgd_to_fpbma_cifar100} show results consistent with Figure~\ref{fig:sgd_to_fpbma} across various learning rate schedulers and metrics. Specifically, (1) BNNs struggle to ensure flatness compared to DNNs when using SGD, and (2) the proposed FP-BMA enables BNN frameworks to achieve flat minima, thereby enhancing performance.

\begin{figure}[ht]
\captionsetup{skip=0pt}
\centering
\includegraphics[width=0.75\textwidth]{figure/sgd_to_sabma/sgd_to_fpbma_cifar10.png} 
\caption{Comparison of Error, NLL, and ECE with various schedulers on CIFAR10 in relation to the maximum eigenvalue $\lambda_1$.}
\label{fig:sgd_to_fpbma_cifar10}
\end{figure}

\begin{figure}[ht]
\captionsetup{skip=0pt}
\centering
\includegraphics[width=0.75\textwidth]{figure/sgd_to_sabma/sgd_to_fpbma_cifar100.png} 
\caption{Comparison of Error, NLL, and ECE with various schedulers on CIFAR100 in relation to the maximum eigenvalue $\lambda_1$.}
\label{fig:sgd_to_fpbma_cifar100}
\end{figure}





\clearpage
\subsection{Performance Changes Based On The Number of Models In BMA}\label{subsec:performance_changes_based_on_the_number_of_models_in_bma}
We also inspect the influence of flatness on BMA performance throughout all considered schedulers. We train ResNet18 on CIFAR10 and CIFAR100, again. Figure~\ref{fig:c10_flat_bma_plot} and Figure~\ref{fig:c100_flat_bma_plot} show the results in CIFAR10 and CIFAR100, respectively. Each row means Constant, Cosine Decay, and SWAG lr scheduler, and each column denotes the classification error, ECE, and NLL.

Two main findings were observed consistent with Figure~\ref{fig:bma_num_plot_nll}: (1) BNNs trained using the proposed FP-BMA showed superior performance compared to those trained with SGD, suggesting that flatness influences posterior quality and contributes to enhanced BMA performance. (2) FP-BMA training allowed predictive distributions to converge with fewer BMA samples, meaning effective approximation can be achieved with a smaller number of samples.

% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% %%% Figure: BMA plot on CIFAR10
% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure}[h]
  \centering
  \begin{subfigure}{0.33\textwidth}
    \centering
    \includegraphics[width=\linewidth]{figure/bma_num_plot/c10/c10_constant_error.png}
    \caption{Constant - Error}
  \end{subfigure}%
  \begin{subfigure}{0.33\textwidth}
    \centering
    \includegraphics[width=\linewidth]{figure/bma_num_plot/c10/c10_constant_ece.png}
    \caption{Constant - ECE}
  \end{subfigure}%
  \begin{subfigure}{0.33\textwidth}
    \centering
    \includegraphics[width=\linewidth]{figure/bma_num_plot/c10/c10_constant_nll.png}
    \caption{Constant - NLL}
  \end{subfigure}\\[1ex]
    \begin{subfigure}{0.33\textwidth}
    \centering
    \includegraphics[width=\linewidth]{figure/bma_num_plot/c10/c10_cosdecay_error.png}
    \caption{Cos Decay - Error}
  \end{subfigure}%
  \begin{subfigure}{0.33\textwidth}
    \centering
    \includegraphics[width=\linewidth]{figure/bma_num_plot/c10/c10_cosdecay_ece.png}
    \caption{Cos Decay - ECE}
  \end{subfigure}%
  \begin{subfigure}{0.33\textwidth}
    \centering
    \includegraphics[width=\linewidth]{figure/bma_num_plot/c10/c10_cosdecay_nll.png}
    \caption{Cos Decay - NLL}
  \end{subfigure}\\[1ex]
    \begin{subfigure}{0.33\textwidth}
    \centering
    \includegraphics[width=\linewidth]{figure/bma_num_plot/c10/c10_swaglr_error.png}
    \caption{SWAG lr - Error}
  \end{subfigure}%
  \begin{subfigure}{0.33\textwidth}
    \centering
    \includegraphics[width=\linewidth]{figure/bma_num_plot/c10/c10_swaglr_ece.png}
    \caption{SWAG lr - ECE}
  \end{subfigure}%
  \begin{subfigure}{0.33\textwidth}
    \centering
    \includegraphics[width=\linewidth]{figure/bma_num_plot/c10/c10_swaglr_nll.png}
    \caption{SWAG lr - NLL}
  \end{subfigure}\\[1ex]
  \caption{Performance variation based on sampling considering flatness among BMA on CIFAR10. Each row means the Constant, Cos Decay, and SWAG lr scheduler. Each column denotes classification error, ECE, and NLL. It reveals that the flatness should be taken into account for efficient BMA.
}
  \label{fig:c10_flat_bma_plot}
\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% %%% Figure: BMA plot on CIFAR100
% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure}[h]
  \centering
  \begin{subfigure}{0.33\textwidth}
    \centering
    \includegraphics[width=\linewidth]{figure/bma_num_plot/c100/c100_constant_error.png}
    \caption{Constant - Error}
  \end{subfigure}%
  \begin{subfigure}{0.33\textwidth}
    \centering
    \includegraphics[width=\linewidth]{figure/bma_num_plot/c100/c100_constant_ece.png}
    \caption{Constant - ECE}
  \end{subfigure}%
  \begin{subfigure}{0.33\textwidth}
    \centering
    \includegraphics[width=\linewidth]{figure/bma_num_plot/c100/c100_constant_nll.png}
    \caption{Constant - NLL}
  \end{subfigure}\\[1ex]
    \begin{subfigure}{0.33\textwidth}
    \centering
    \includegraphics[width=\linewidth]{figure/bma_num_plot/c100/c100_cosdecay_error.png}
    \caption{Cos Decay - Error}
  \end{subfigure}%
  \begin{subfigure}{0.33\textwidth}
    \centering
    \includegraphics[width=\linewidth]{figure/bma_num_plot/c100/c100_cosdecay_ece.png}
    \caption{Cos Decay - ECE}
  \end{subfigure}%
  \begin{subfigure}{0.33\textwidth}
    \centering
    \includegraphics[width=\linewidth]{figure/bma_num_plot/c100/c100_cosdecay_nll.png}
    \caption{Cos Decay - NLL}
  \end{subfigure}\\[1ex]
    \begin{subfigure}{0.33\textwidth}
    \centering
    \includegraphics[width=\linewidth]{figure/bma_num_plot/c100/c100_swaglr_error.png}
    \caption{SWAG lr - Error}
  \end{subfigure}%
  \begin{subfigure}{0.33\textwidth}
    \centering
    \includegraphics[width=\linewidth]{figure/bma_num_plot/c100/c100_swaglr_ece.png}
    \caption{SWAG lr - ECE}
  \end{subfigure}%
  \begin{subfigure}{0.33\textwidth}
    \centering
    \includegraphics[width=\linewidth]{figure/bma_num_plot/c100/c100_swaglr_nll.png}
    \caption{SWAG lr - NLL}
  \end{subfigure}\\[1ex]
  \caption{Performance variation based on sampling considering flatness among BMA on CIFAR100. Each row means the Constant, Cos Decay, and SWAG lr scheduler. Each column denotes classification error, ECE, and NLL. It reveals that the flatness should be taken into account for efficient BMA.
}
  \label{fig:c100_flat_bma_plot}
\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
