\subsection{Bayesian Neural Networks}\label{subsec:bayesian_neural_network}

\paragraph{Training}
Let $w \subseteq \mathbb{R}^p$ be the model parameter of BNN and $\mathcal{D} = \{(x, y)\}$ be the datasets with inputs $x$ and outputs $y$. In principle, training BNNs aims to estimate the posterior distribution $p(w|\mathcal{D})$ based on Bayes' Rule:
\begin{equation}
\label{eq:bayes_rule}
p(w|\mathcal{D}) = \frac{p(\mathcal{D}|w)p(w)}{\int_w p(\mathcal{D}|w)p(w) dw},
\end{equation}
where $p(\mathcal{D}|w)$ and $p(w)$ denote the likelihood of data $\mathcal{D}$ and the prior distribution over $w$, respectively. 

However, the posterior of BNNs $p(w|\mathcal{D})$ is intractable in general. Hence, many approximate inference methods, including Markov Chain Monte Carlo (MCMC)~\citep{welling2011bayesian, chen2014stochastic} and Variational Inference (VI)~\citep{graves2011practical, blundell2015weight}, and other variants~\citep{ritter2018scalable, gal2016dropout, maddox2019simple}, have been employed to obtain approximate posterior $q_\theta(w|\mathcal{D})$, with distribution's parameter $\theta \subseteq \mathbb{R}^q$, pursuing $q_\theta(w|\mathcal{D}) \approx p(w|\mathcal{D})$.


\paragraph{Prediction}
For the approximate posterior $q_\theta (w | \mathcal{D})$, BNNs make predictions on unobserved data $(x^*, y^*)$ via \textit{Bayesian Model Averaging (BMA)}, which integrates predictions over the posterior distribution of the model parameters:

\begin{align}
p(y^*|x^*, \mathcal{D}) 
&\approx \int_w p(y^*|f_w(x^*))q_\theta(w|\mathcal{D}) dw \label{eq:mc_predictive} \\
&\approx \frac{1}{M} \sum_{m=1}^M p(y^*|f_{w_m}(x^*)), \ \  w_m \sim q_\theta(w|\mathcal{D}), \nonumber
\end{align}
where $f_w(\cdot)$ is predictions with parameter $w$ and $M$ denotes the number of sampled model; the first approximation uses $q_\theta(w|\mathcal{D})$ and second approximation in Eq.~\ref{eq:mc_predictive} employs Monte Carlo integration. This approach is known to improve generalization by averaging over a diverse set of models sampled from the approximate posterior, which is the core idea of BMA~\citep{wilson2020bayesian}.


\begin{figure*}[t]
    \centering
    \begin{subfigure}[t]{0.32\textwidth}
        \centering
        \includegraphics[width=1.0\linewidth]{figure/corr/cifar10_swag-sgd-constant_corr.png}
        \captionsetup{justification=centering}
        \caption{Correlation between flatness and generalization within sampled models}
        \label{fig:correlation_plot}
    \end{subfigure}
    \hfill
    \begin{subfigure}[t]{0.32\textwidth}
        \centering
        \includegraphics[width=1.0\linewidth]{figure/sgd_to_sabma/sgd_to_fpbma.png}
        \captionsetup{justification=centering}
        \caption{Flatness and generalization according to the training methods}
        \label{fig:sgd_to_fpbma}
    \end{subfigure}
    \hfill
    \begin{subfigure}[t]{0.32\textwidth}
        \centering
        \includegraphics[width=1.0\linewidth]{figure/bma_num_plot/c10/c10_cosdecay_nll.png}
        \captionsetup{justification=centering}
        \caption{NLL with the number of BMA models}
        \label{fig:bma_num_plot_nll}
    \end{subfigure}
    \caption{(a) Flatness, measured via the maximal Hessian eigenvalue ($\lambda_1$), is highly correlated with generalization ability (classification error, ECE, and NLL), suggesting that the flatness of models sampled from the posterior is correlated with generalization ability. (b) The existing inferences of BNNs (SWAG, VI, and MCMC) with SGD struggle to capture the flatness compared to DNNs. In contrast, the proposed Bayesian flat posterior-aware optimizer FP-BMA allows BNNs to seek flat minima, improving performance. (c) Flat posteriors are necessary, as increasing the number of BMA samples does not lead to better performance if the posterior is not flat. The proposed FP-BMA enhances posterior quality by capturing flatness and requires fewer samples for improved BMA.}
    \label{fig:combined_fig}
\end{figure*}



\subsection{Flatness and Optimization}\label{subsec:flatness_and_optimization}
As the flatness of loss surface has been known to be connected to the generalizability~\citep{hochreiter1994simplifying, hochreiter1997flat,neyshabur2017exploring}, new training methods have been presented to find the flat local optimum. Sharpness-Aware Minimization (SAM)~\citep{foret2020sharpness} is a widely adopted technique that seeks flat minima by making the model robust to small perturbations in parameters. SAM performs adversarial training by minimizing the worst-case loss in an $L_2$ neighborhood of the weights:
\begin{equation}
\label{eq:sam_loss}
% \ell^\gamma_{\text{SAM}}(w) = \min_w \max_{\|\Delta w\|_p \leq \gamma} \ell(w+\Delta w).
\ell^\gamma_{\text{SAM}}(w) = \min_w \max_{\|\Delta w\|_p \leq \gamma} \ell(f_{w+\Delta w}(x), y),
\nonumber
\end{equation}
where $\ell(\cdot)$ is the empirical loss function (e.g., cross-entropy for classification tasks) and  $p$ is practically set to $p=2$, yielding $\Delta w = \gamma \nabla_w \ell(w) / \|\nabla_w \ell(w) \|_2$.

However, SAM's isotropic $L_2$ ball may not accurately reflect the an isotropic geometry of the loss landscape. To address this, Fisher SAM (FSAM)~\citep{kim2022fisher} improves SAM by replacing the Euclidean ball with a non-Euclidean one defined by the Fisher information matrix (FIM):
\begin{equation}
\label{eq:fsam_loss}
\ell^\gamma_{\text{FSAM}}(w) = \min_w \max_{\|F_y(w) \Delta w \|_p \leq \gamma^2} \ell(f_{w+\Delta w}(x), y),
\nonumber
\end{equation}
where $F_y(w)$ denotes the FIM and is approximated as $F_y(w) = 1/|B| \nabla_w \log p(y|x, w)^2$ with $|B|$ batch size. SAM and FSAM are both derived under deterministic $w$, and $F_y (w)$ is defined over the predictive distribution $p(y|f_w(x))$, not in the parameter space.