In this section, we explore the flatness of BNNs' posterior obtained from the widely-used approximate Bayesian inferences and demonstrate that flatness should be considered for BNNs based on both empirical and theoretical grounds.

\paragraph{Experimental Setup}
We empirically inspect the flatness of BNNs and observe that the generalization ability of BMA prediction improves as weight samples are drawn from a flatter posterior. To this end, we train ResNet18~\citep{he2016deep} without Batch Normalization~\citep{ioffe2015batch} on CIFAR10~\citep{krizhevsky2009learning} using following Bayesian inference methods-VI, SWAG, and MCMC-to yield the approximate posterior $q_\theta(w|\mathcal{D})$. We then compare the generalization ability, classification error, negative log-likelihood (NLL), and expected calibration error (ECE) with the flatness of the approximate posterior.


\paragraph{Flatness criterion for BNNs}
To evaluate the flatness of the posterior, we use the average of Hessian's eigenvalues, unlike in DNNs, where flatness is assessed using individual eigenvalues. This difference stems from the fact that the loss of BNNs is formulated as the marginal likelihood over the posterior, incorporating multiple parameter samples $\{w_m\}_{m=1}^{M}$ drawn from $w_m \sim q_\theta(w|\mathcal{D})$. We use the averaged $i$-th maximal eigenvalue of Hessian:
\begin{equation}
\begin{split}
\label{eq:bma_hessian}
&\lambda_i \approx \frac{1}{M}\sum_{m=1}^M \lambda_i (H_{f_m}), \quad H_{f_m} = \nabla^2 \ell\big(f_{w_m}(x), y\big),
\end{split}
\end{equation}

where $\ell\big(f_{w_m}(x), y\big) \coloneqq -\log{p(y|f_{w_m}(x))}$ denotes the likelihood using  $m$-th parameter sample $w_m \sim q_\theta(w|\mathcal{D})$. The $H_{f_m}$ denotes the Hessian of the loss $\ell \big(f_{w_m}(x), y\big)$ and $\lambda_i(H_{f_m})$ denotes the $i$-th maximal eigenvalue of Hessian. Notably, the smaller largest eigenvalues of the Hessian indicate that the model parameters lie in a flatter region of the loss surface. Therefore, the maximal eigenvalue $\lambda_{1}$ or the eigenvalue ratio $\lambda_{1}/\lambda_5$ is often used to assess the flatness of model parameters~\citep{keskar2016large, foret2020sharpness, jastrzebski2020break}.


\subsection{Need for Flatness in BMA}\label{subsec:need_for_flatness_in_bma}
\paragraph{Takeaway 1: The flatness of models sampled from the posterior is correlated with generalization ability.}
Figure~\ref{fig:correlation_plot} compares normalized generalization ability—measured by Error, ECE, and NLL—against flatness of BMA models sampled from posterior trained with SWAG. The results reveal a strong positive correlation between flatness and generalization ability, suggesting that \emph{models sampled from the posterior is correlated with generalization ability, same as DNNs.} We confirm that this property holds across different learning rate schedulers and the CIFAR-100 dataset, as shown in Figure~\ref{fig:additional_corr_plot} (Appendix~\ref{subsubsec:correlation_between_flatness_and_generalization}).


\paragraph{Takeaway 2: It is essential to approximate a flat posterior for BMA.}
We also establish a generalization error bound for BMA that explicitly involves the flatness of the posterior. First, we show that the flatness of BMA is determined by that of individual BMA samples, highlighting
the necessity of a flat posterior for effective BMA performance.
% Specifically, we demonstrate the flatness bound of simple weight averaging and connect it to that of BMA based on Hessian eigenvalues in Lemma~\ref{lemma:bma_eign_bound} (Detailed proof in Appendix~\ref{subsec:proof_of_theorem_1}).

\begin{lemma}
\label{lemma:bma_eign_bound}
Let twice differentiable loss $\ell(\cdot)$, predictions of model $f_m(\cdot)$ parameterized by $w_m$, and predictions of BMA $f_{\text{BMA}}(\cdot).$
With $M$ model sample $\{w_m\}_{m=1}^M$, the maximal eigenvalue of averaged Hessian of loss $\lambda_{\text{max}}(H_{f_{\text{BMA}}})$ is bounded as follow:
\begin{align}
\label{eq:bma_eign_bound}
&\max \left( \Bigg{\{} \frac{1}{M} \bigg{(} \lambda_{\max}(H_{f_m}) + \sum_{\substack{n=1 \\ n \neq m}}^{M} \lambda_{\min}(H_{f_n}) \bigg{)} \Bigg{\}}_{m=1}^M \right) \\
& \qquad \le \lambda_{\max}(H_{f_{\text{BMA}}}) \le \frac{\sum_{m=1}^M \lambda_{\max}(H_{f_m})}{M}.
\end{align}
\end{lemma}

Lemma~\ref{lemma:bma_eign_bound} implies that as $\lambda_{\max}(H_{f_m})$ decreases in Eq.~\ref{eq:bma_eign_bound}, where it appears in both the lower and upper bounds, the corresponding $\lambda_{\max}(H_{f_{\text{BMA}}})$ also decreases. This decrease in
$\lambda_{\max}(H_{f_{\text{BMA}}})$ represents that that the BMA prediction operates within flatter minima. Given Lemma~\ref{lemma:bma_eign_bound}, the following theorem shows that the generalization error of the BMA predictor is directly controlled by the flatness of the posterior, as measured by the maximal Hessian eigenvalue~\citep{luo2024explicit}.

\begin{theorem}[Informal]\label{theorem:bma_generalization}
Let $f_{\text{BMA}}$ be the BMA predictor obtained by averaging over posterior samples. Then, with high probability,
\begin{align*}
    \ell_{\mathcal{D}}(f_{\text{BMA}})
    \leq \ell_{\mathcal{S}}(f_{\text{BMA}})
    + \frac{p \sigma^2}{2} \lambda_{\max}(H_{f_{\text{BMA}}})
    + O\left(\sigma^3 p^3 \right)
\end{align*}
where $\ell_{\mathcal{D}}$ and $\ell_{\mathcal{S}}$ denote the population and empirical loss, respectively, $p$ is the number of model parameters, $n$ is the sample size, $\sigma$ is a smoothing parameter, and $H_{f_{\text{BMA}}}$ is the Hessian of the loss evaluated at $f_{\text{BMA}}$.
\end{theorem}

This result formally supports our main message: \emph{BMA with a flat posterior---that is, a posterior whose samples $f_m$ exhibit smaller $\lambda_{\max}(H_{f_m})$---leads to a tighter generalization error bound.} Thus, posterior flatness is not only empirically correlated with generalization, but also a theoretically well-justified objective for BMA. A detailed proof of Theorem~\ref{theorem:bma_generalization} is provided in Appendix~\ref{subsec:proof_of_theorem_1}.




\subsection{Insufficient Flatness of BMA} \label{subsec:insufficient_flatness_of_bma}
\paragraph{Takeaway 3: Most approximate Bayesian inference methods struggle to produce a flat posterior.}
We investigate whether existing approximate Bayesian inference methods can produce the flat posterior of BNNs. Figure~\ref{fig:sgd_to_fpbma} illustrates how NLL and posterior flatness vary depending on whether flatness in the loss surface is taken into account during optimization. We observe that \emph{the approximate posteriors of BNNs do not show better flatness compared to that of DNNs, obtained from the SAM optimizer}.


On the other hand, the proposed FP-BMA, which will be described in Section~\ref{subsec:flat_posterior_aware_optimizer}, allows BNNs to seek flat minima and thus leads to better performance. We also confirm consistent results on various learning rate schedulers and generalization performance metrics, as described in Appendix~\ref{subsec:insufficient_flatness_of_bma_app}.


\paragraph{Takeaway 4: Increasing the
number of BMA samples does not lead to better performance without a flat posterior.}
Figure~\ref{fig:bma_num_plot_nll} compares the NLL of BMA predictions for two posteriors—one considering flatness and the other not. The results show that \emph{simply increasing the number of weight samples in BMA does not outperform BMA with a flat posterior}, highlighting the importance of posterior flatness for better generalization. On the other hand, the proposed FP-BMA, which will be described in Section~\ref{subsec:flat_posterior_aware_optimizer}, enhances posterior quality by capturing flatness and requires fewer samples for improved BMA. Additional results in Appendix~\ref{subsec:performance_changes_based_on_the_number_of_models_in_bma} confirm this trend across different learning rate schedulers and metrics.