Our theoretical analysis and empirical findings suggest that a flat posterior in BNNs is crucial for generalization but is not achieved by existing approximate Bayesian inference methods. To address this, we propose an optimizer that encourages a flat posterior (Section \ref{subsec:flat_posterior_aware_optimizer}) introduce Bayesian transfer learning combined with diverse BNN frameworks (Section \ref{subsec:bayesian_transfer_learning}).



\subsection{Flat Posterior-aware Optimizer}
\label{subsec:flat_posterior_aware_optimizer}
To deal with the probabilistic nature of BNNs, we suggest a
new objective function based on VI:
\begin{equation}
\label{eq:main_loss}
\begin{aligned}
    \ell_{\text{FP-BMA}}^\gamma(\theta) = &\min_\theta \max_{d|\theta+\Delta\theta, \theta| \leq \gamma^2} \ell(\theta+\Delta\theta) \\
    &+ \beta \textrm{D}_{\textrm{KL}} [q_\theta (w|\mathcal{D}) || p (w)] 
\end{aligned}
\end{equation}

\begin{equation}
\label{eq:divergence}
    \textrm{s.t.} \ \  d|\theta+\Delta\theta, \theta| = \textrm{D}_{\textrm{KL}} \big[ q_{\theta+\Delta\theta}(w |\mathcal{D}) \ || \ q_{\theta}(w | \mathcal{D}) \big],
\end{equation}
where $\theta$ and $\Delta\theta$ denote the variational parameters and their perturbation, respectively. $\ell(\theta + \Delta\theta)$ denotes empirical loss under perturbated posterior $q_{\theta + \Delta\theta}(w|\mathcal{D})$, and $\beta$ is a hyperparameter that controls the influence of the prior.

Given new objective  $\ell_{\text{FP-BMA}}^\gamma(\theta)$ in Eq.~\ref{eq:main_loss},
the variational parameter $\theta$ is updated using the approximate gradient
\begin{equation}
\label{eq:update_sabma}
\nabla_\theta \ell^\gamma_{\text{FP-BMA}} (\theta) \approx \nabla_\theta \ell(\theta + \Delta\theta_{\text{FP-BMA}}),
\end{equation}
where the parameter perturbation $\Delta\theta_{\text{FP-BMA}}$ is first computed as:
\begin{equation}
\label{eq:Delta_theta_star}
        \Delta\theta_{\text{FP-BMA}} = \gamma\frac{F_\theta(\theta)^{-1} \nabla_\theta \ell(\theta)}{\sqrt{\nabla_\theta \ell(\theta)^T F_\theta(\theta)^{-1} \nabla_\theta \ell(\theta)}},
\end{equation}
using FIM $F_\theta({\theta}){=}\mathbb{E}_{w,\mathcal{D}} [\nabla_{\theta}\log q_\theta(w|\mathcal{D}) \nabla_{\theta} \log q_\theta(w|\mathcal{D})^{T}]$. After that, the gradient $\nabla_{\theta} \ell(\theta)$ is evaluated at $\theta + \Delta\theta_{\text{FP-BMA}}$. We notate our objective as FP-BMA and provide a detailed formula derivation in Appendix~\ref{subsec:derivation_odf_bayesian_flat_seeking_optimizer}. 
%, with the corresponding Algorithm~\ref{alg:FP-BMA} (Appendix~\ref{subsec:algorith_of_fpbma}).

Our proposed FP-BMA optimizer offers several key advantages, which are detailed in the following paragraphs. The main advantages of FP-BMA are summarized as follows:
\begin{itemize}
    \item \textbf{Implicit Flatness Control}
    \item \textbf{KL-based Bayesian Perturbation Ball}
    \item \textbf{Generalized Version of Geometric Optimizers}
\end{itemize}

\paragraph{Implicit Flatness Control}
Eq.~\ref{eq:main_loss} implicitly controls sharpness by penalizing solutions that are sensitive to parameter perturbations. This can be formalized via a second-order Taylor expansion of the inner maximization:
\begin{align}
\label{eq:second_order_inner_max}
&\max_{d|\theta+\Delta\theta, \theta| \leq \gamma^2} \ell(\theta+\Delta\theta) \\
& \quad \approx \max_{d|\theta+\Delta\theta, \theta| \leq 1} \left( \ell(\theta) + \gamma^2 \Delta\theta^\top \nabla_\theta \ell(\theta) + \frac{\gamma^4 \lambda_1(H_{f_\theta})}{2} \right) \nonumber
\end{align}
where $\lambda_1(H_{f_\theta})$ is the maximal eigenvalue of the Hessian. Thus, minimizing Eq.~\ref{eq:main_loss} inherently seeks solutions with lower sharpness, ensuring the variational posterior is concentrated in flatter regions of the loss landscape.


\paragraph{KL-based Bayesian Perturbation Ball}
Unlike deterministic optimizers such as SAM and FSAM, our method constrains the perturbation in distributional space via the KL-divergence, as shown in Eq.~\ref{eq:divergence}. This approach leverages local curvature information without expensive inner-loop optimization, making the method scalable and practical for large models. In addition, for Gaussian variational posteriors, the KL-based constraint naturally captures both mean and variance changes—providing a richer and more Bayesian-consistent notion of flatness.




%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Figure] Synthetic Example
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure}[t]
\centering
\includegraphics[width=\linewidth]{figure/posterior_approx/posterior_approx.png} % height=7cm
\caption{Posterior approximation with synthetic example. When both flat and sharp modes coexist, we compared how optimizers approximate the posterior. Unlike other methods, the proposed FP-BMA converged to the flat mode.}
\label{fig:posterior_approx}
\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% [TABLE] Scratch
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\input{tables/scratch_rn18_vitb16.tex}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


\paragraph{Generalized version of geometric optimizers}
FP-BMA is a generalized version of SAM and FSAM and approximates NG under deterministic parameters, as shown in Theorem~\ref{theorem:generalized_sabma}. Proof of Theorem~\ref{theorem:generalized_sabma} is provided in Appendix~\ref{subsec:proof_of_theorem2}.


\begin{theorem}
\label{theorem:generalized_sabma}
(Informal) Suppose the model parameter $w$ is deterministic and the loss function $\ell(\cdot)$ is twice continuously differentiable. Let $\gamma^\prime = \gamma/\sqrt{\nabla_\theta \ell(\theta)^T F_\theta (\theta)^{-1} \nabla_\theta \ell(\theta)}$, then
\begin{enumerate}[label=\roman*)]
    \item FP-BMA degenerates to SAM if FIM is an identity matrix.
    \item FP-BMA degenerates to FSAM by using the diagonal terms of FIM.
    \item FP-BMA approximates the update rule of NG with learning rate $\eta_{\text{FP-BMA}} = \frac{\eta_{\text{NG}}}{(1 + \gamma^\prime)} F_\theta (\theta)^{-1}$, where $\eta_{\text{NG}}$ denotes the learning rate of NG.
\end{enumerate}
\end{theorem}

This unifying perspective implies that FP-BMA can seamlessly adapt to both deterministic and Bayesian scenarios, providing a principled way to leverage geometric properties of the loss landscape in probabilistic models. As a result, FP-BMA inherits the empirical benefits of sharpness-aware and natural gradient optimizers—such as improved generalization and robustness—while extending their applicability to Bayesian neural networks in a theoretically grounded manner.



\subsection{Flat Posterior-aware Bayesian Transfer Learning}\label{subsec:bayesian_transfer_learning}
Additionally, we extend the proposed objective to seek the flat posterior for Bayesian transfer learning. For the given approximate posterior $q^{\text{pr}}_{\theta}(w|\mathcal{D}^{\text{pr}})$ on source or downstream task $\mathcal{D}^{\text{pr}}$, we set our objective:
\begin{equation}
\label{eq:bayesian_transfer_learning}
\begin{aligned}
    \ell^\gamma_{\text{FP-BMA}}(\theta) &= \min_\theta \max_{d|\theta+\Delta\theta, \theta| \leq \gamma^2} \ell(\theta + \Delta\theta)\\
    &+ \beta \textrm{D}_{\textrm{KL}}[q_\theta (w|\mathcal{D}^{\text{ft}}) || q^{\text{pr}}_{\theta}(w|\mathcal{D}^{\text{pr}})]
\end{aligned}
\end{equation}
\begin{equation}
\label{eq:divergence_transfer}
    \textrm{s.t.} \ \  d|\theta+\Delta\theta, \theta| = \textrm{D}_{\textrm{KL}} \big[ q_{\theta+\Delta\theta}(w |\mathcal{D^{\text{ft}}}) \ || \ q_{\theta}(w | \mathcal{D}^{\text{ft}}) \big], \nonumber
\end{equation}
where $\mathcal{D}^{\text{ft}}$ is the downstream dataset. Intuitively, this objective replaces the prior distribution of Eq.~\ref{eq:divergence} by the approximate posterior $q^{\text{pr}}_{\theta}(w|\mathcal{D}^{\text{pr}})$ on source dataset. Notably, \emph{the proposed objective $\ell^\gamma_{\text{FP-BMA}}(\theta)$ can be effective in general transfer learning where the model misspecification~\citep{muller2013risk, wilson2020bayesian} exists; the prior is not suitable for the BNNs to be fine-tuned on downstream tasks, and flat parameters have been shown to improve the model's robustness~\citep{kim2022sufficient,zhang2023flatness}.}
% Indeed, we demonstrate that the proposed method achieves superior generalization under distributional shifts in Section~\ref{subsec:robustness_on_distribution_shift}.


For computational efficiency, we adopt a sub-network BNN strategy, focusing training on normalization and last-layer parameters, as explored in prior works~\citep{izmailov2020subspace, daxberger2021bayesian, sharma2023bayesian}. During fine-tuning, we reinitialize the last layer with a Gaussian distribution, $\mathcal{N}(0, \alpha I)$, where $\alpha$ is a hyperparameter to control variance. This approach ensures scalable and stable training by leveraging pre-trained DNNs. The complete FP-BMA procedure is given in Algorithm \ref{alg:FP-BMA} (Appendix \ref{subsec:algorith_of_fpbma}).