\section{Introduction}
Modern deep neural networks trained with SGD and its variants have achieved surprising successes: the overparametrized networks often contain more parameters than the size of training dataset,  and yet are capable of generalizing well on the testing set; this contrasts the traditional wisdom in statistical learning that suggests such high-capacity models will overfit the training data and fail on the unseen data \citep{zhang2017understanding}.  Intense recent efforts have been spent to explain this peculiar phenomenon via investigating the properties of SGD \citep{arpit2017closer,bartlett2017spectrally,neyshabur2017exploring,arora2019fine}, and the current understanding is still far from being complete. For example, neural tangent kernel (NTK)-based generalization bounds of SGD normally require the width of network to be sufficiently large (or even go to infinite) \citep{arora2019fine}, and the stability-based bounds of SGD have a poorly dependence on an intractable Lipschitz constant \citep{hardt2016train,bassily2020stability}.
% \textcolor{red}{[NEED A highlight/summary of what is known and why it is not good enough. Or point the reader to the related works and discuss there.]} \looseness=-1
% In particular, the implicit regularization effect of SGD is often invoked in the literature. It then becomes well-known that such implicit regularization of SGD is crucial for both optimization and generalization \citep{barrett2020implicit,smith2020origin}. 
% \vspace{-3mm}
\begin{figure*}[!ht]
    \centering
    \begin{subfigure}[b]{0.245\textwidth}
\includegraphics[scale=0.28]{figs/acc-plot-svhn-vgg-1.png}    
\caption{VGG on (small) SVHN}            \label{fig:vgg-svhn-acc}
    \end{subfigure}
\begin{subfigure}[b]{0.245\textwidth}
\includegraphics[scale=0.28]{figs/acc-plot-cifar10-vgg-1.png}
\caption{VGG on CIFAR10}
    \label{fig:vgg-cifa10-acc}
\end{subfigure}
 \begin{subfigure}[b]{0.245\textwidth}
\includegraphics[scale=0.28]{figs/acc-plot-cifar10-resnetwobn-1.png}
\caption{ResNet on CIFAR10}
\label{fig:resnet-cifa10-acc}
    \end{subfigure}
\begin{subfigure}[b]{0.245\textwidth}
\includegraphics[scale=0.28]{figs/acc-plot-cifar100-resnet-1.png}
\caption{ResNet on CIFAR100}
\label{fig:resnet-cifa100-acc}
\end{subfigure}
\caption{Performance of VGG-11 and ResNet-18 trained with SGD and SDE.
% Standard data augmentation techniques are only used in (d).
}\label{fig:Acc-Dynamics}
% \vspace{-4mm}
\end{figure*}
% \vspace{-0.1in}

% \textcolor{red}{Gap between SGD and SGLD, SDE is necessary}
Recently, information-theoretic generalization bounds have been developed to analyze the expected generalization error of a learning algorithm. The main advantage of such bounds is that they are not only distribution-dependent, but also algorithm-dependent, making them an ideal tool for studying the generalization behaviour of models trained with a specific algorithm, such as SGD.
%which seems necessary to obtain the non-vacuous bound in practice. 
The concept of mutual information (MI)-based bounds can be traced back to \cite[Section~1.3.1]{catoni2007pac}, where it is discussed how the prior distribution that minimizes the expected KL complexity term in a PAC-Bayesian (PAC-Bayes) bound can transform the KL term into the MI term. This idea has recently been popularized by \citep{russo2016controlling,russo2019much,xu2017information}, and further strengthened by additional techniques \citep{asadi2018chaining,negrea2019information,bu2019tightening,steinke2020reasoning,haghifam2020sharpened,wang2021analyzing}.
% Mutual information-based bounds are first proposed
% they are recently popularized by \citep{russo2016controlling,russo2019much,xu2017information}. They are then strengthened by additional techniques \citep{asadi2018chaining,negrea2019information,bu2019tightening,steinke2020reasoning,haghifam2020sharpened,wang2021analyzing}. 
Particularly, \citet{negrea2019information} derive MI-based bounds by developing a PAC-Bayes-like bounding technique, which upper-bounds the generalization error in terms of the KL divergence between the posterior distribution of learned model parameter given by a learning algorithm with respect to any data-dependent prior distribution. It is remarkable that the application of these information-theoretic techniques usually requires the learning algorithm to be an iterative noisy algorithm, such as stochastic gradient Langevin dynamics (SGLD) \citep{raginsky2017non,pensia2018generalization}, so as to avoid the MI bounds becoming infinity, and can not be directly applied to SGD. While it is feasible to adopt similar techniques from the PAC-Bayes bounds in \cite{lotfi2022pacbayes} to analyze SGD, in order to study the generalization of SGD using methods akin to those for SGLD, \citet{neu2021information} and \citet{wang2022generalization} develop generalization bounds for SGD via constructing an auxiliary iterative noisy process. However,  identifying an optimal auxiliary process is difficult, and arbitrary choices may not provide meaningful insights into the generalization of SGD, see Appendix~\ref{sec:IT-SGD} for more discussions.

% so additional complexity must be dealt with in that analysis.


% construct an auxiliary iterative noisy process to bound the SGD process and develop generalization bounds for SGD via . additional complexity must be delt with in that analysis.


%which improve upon the previous results by letting the MI term be evaluated on a subset of the training sample.  These MI-based bounds are typically useful to noisy iterative algorithms.  For example,  \cite{pensia2018generalization} first apply the information-theoretic bound given by \cite{xu2017information} to analyze the generalization property of stochastic gradient Langevin dynamics (SGLD). Since the noise used in SGLD is usually an isotropic Gaussian, by utilizing the closed form of KL divergence between two Gaussian distributions, the information-theoretic generalization bound for SGLD is shown to have a tractable form. Their result is then improved by stronger bounds in the literature \citep{bu2019tightening,negrea2019information,haghifam2020sharpened,wang2022generalization}. MI bounds, however, can not be directly used to analyze the generalization property of vanilla SGD since the MI term in the bound may go to infinity in this case and one could not obtain a tractable form by using following the approach of \cite{pensia2018generalization}. Recently,  \cite{neu2021information} and \cite{wang2022generalization} have studied the generalization of models trained with SGD and obtained new MI bounds, using a technique via constructing an auxiliary perturbed weight process; additional complexity must be delt with in that analysis. Thus there appears significant room for improved understanding of the generalization of SGD.\looseness=-1



Recent research has suggested that the SGD dynamics can be well approximated by using stochastic differential equations (SDEs), where the gradient signal in SGD is regarded as the full-batch gradient perturbed with an additive Gaussian noise. 
%There have been some recent attempts formulating SGD dynamics using stochastic differential equations (SDE), in which the key component is the modelling of the gradient noise. 
Specifically, \cite{mandt2017stochastic} and \cite{jastrzkebski2017three} model this gradient noise drawn from a Gaussian distribution with a fixed covariance matrix,
% a locally fixed Gaussian, 
thereby viewing SGD as performing variational inference. \cite{zhu2019anisotropic,wu2020noisy,xie2020diffusion}, and \cite{xie2021positive} further model the gradient noise as dependent of the current weight parameter and the training data. 
% Modelling SGD in this way provide explanations as to when SGD finds flat minima \citep{zhu2019anisotropic,xie2020diffusion} and sharp minima \citep{ziyin2022sgd}, and inspire some new training techniques \citep{wu2020noisy,xie2021positive}. 
Moreover, \citet{li2017stochastic,li2019stochastic} and \citet{wu2020noisy}
prove that when the learning rate is sufficiently small, the SDE trajectories are theoretically close to those of SGD (cf. Lemma~\ref{lem:sde-weak}). More recently, \cite{li2021validity} has demonstrated that the SDE approximation 
well characterizes the optimization and generalization behavior of SGD without requiring small learning rates.

In this work, we also empirically verify the consistency between the dynamics of SGD and its associated discrete SDE (cf. Eq.~(\ref{eq:sgd-update-gaussian})). As illustrated in Figure~\ref{fig:Acc-Dynamics}, the strong agreement in their performance suggests that, despite the potential presence of non-Gaussian components in the SGD gradient noise, analyzing its SDE through a Gaussian approximation suffices for exploring SGD's generalization behavior.
% \looseness=-1
% The results establishing SDE as a good approximation for the SGD dynamics motivate us to study the generalization behavior of SGD under such approximations. 
Furthermore, under the SDE formalism of SGD, SGD becomes an iterative noisy algorithm, on which the aforementioned information-theoretic bounding techniques can directly apply. In particular, we summarize our contributions below.
\begin{itemize}[leftmargin=*]
    \item We obtain a generalization bound (cf. Theorem \ref{thm:isotropic-prior-bound}) in the form of a summation over training steps of a quantity that involves both the sensitivity of the full-batch gradient to the variation of the training set and the covariance of the gradient noise (which makes the SGD gradient deviate from the full-batch gradient). We also give a tighter bound in Theorem \ref{thm:anisotropic-prior-bound},
    where 
    % the population gradient covariance and also the covariance of the gradient noise are involved, and 
    the generalization performance of SGD depends on the alignment of the population gradient covariance and the batch  gradient covariance.
    % these two matrices.
    These bounds highlight the significance of (the trace of) the gradient noise covariance in the generalization ability of SGD.  
    % Additionally, these results also provide justifications for the effectiveness of regularization by controlling the gradient norm, a technique suggested by previous works. 

    \item In addition to the time-dependent trajectory-based bounds, we also provide time-independent (or asymptotic) bounds by some mild assumptions. Specifically, based on  previous information-theoretic bounds, we obtain generalization bounds in terms of the KL divergence between the steady-state weight distribution of SGD with respect to a distribution-dependent prior distribution (by Lemma~\ref{lem:xu's-bound}) or data-dependent prior distribution (by Lemma~\ref{lem:data-dependent-prior}). The former gives us a bound based on the alignment between the weight covariance matrix for each individual local minimum and the weight covariance matrix for the average of local minima (cf. Theorem~\ref{thm:opt-state-inde-bound}). Under mild assumptions, we can estimate the steady-state weight distribution of SDE (cf. Lemma~\ref{lem:stationary-real}), leading to a variant of Theorem~\ref{thm:opt-state-inde-bound} (cf. Corollary~\ref{cor:pacbayes-anisotropic-prior}) and a norm-based bound (cf. Corollary~\ref{cor:pacbayes-isotropic-prior}). Additionally, we obtain a stability-based like bound by Lemma~\ref{lem:data-dependent-prior} (cf. Theorem \ref{thm:pacbayes-data-dependent-prior}), with the notable omission of the Lipschitz constant in other stability-based bounds. Since stability-based bounds often achieve fast decay rates, e.g., $\mathcal{O}(1/n)$, Theorem \ref{thm:pacbayes-data-dependent-prior} provides theoretical advantages compared with other information-theoretic bounds, as it can attain the same rate of decay as the stability-based bound.
    Comparing to the first family of bounds (i.e., trajectory-based bounds), the second family of bounds directly upper-bound the generalization error via the terminal
state, which avoids summing over training steps; these bounds can be tighter when the steady-state estimates are accurate. On the other hand, not relying on the steady-state estimates and the approximating
assumptions they base upon is arguably an advantage of the first family.

    \item  
    % In addition to comparing the generalization performance of SGD and SDE, we 
    We empirically analyze key components within the derived bounds for both algorithms. Our empirical findings reveal that these components for SGD and SDE align remarkably well, further validating the effectiveness of our bounds for assessing the generalization of SGD. Moreover, we provide numerical validation of the presented bounds and demonstrate that our trajectory-based bound is tighter than the result in \cite{wang2022generalization}. Additionally, compared with norm-based bounds, we show that the terminal-state-based bound that integrates the geometric properties of local minima can better characterize generalization.
    % Trivially choosing the prior distribution to be a Gaussian independent of training data (which reduces to MI bound of \citet{xu2017information}), we obtain a bound
% (Theorem~\ref{thm:pacbayes-isotropic-prior}) in terms of the distance of the weight output from SGD  to the prior estimate (e.g. initialization of weights). It is more interesting  choosing the prior as the steady-state weight distribution obtained by SGD on the same training set but with one example held out. In this case, the bound we obtain (Theorem \ref{thm:pacbayes-data-dependent-prior}) can be elegantly expressed using the influence function \citep{koh2017understanding}, which suggests that the generalization of the SGD is related to the stability of SGD. Comparing to the first family of bounds (i.e., those based on the MI between the training set and training trajectories), the second family of bounds directly bound the generalization error via the terminal state, which avoids summing over training steps; these bounds are in general tighter when the steady-state estimates are accurate. On the other hand, not relying on the steady-state estimates and the approximating assumptions they base upon is arguably an advantage of the first family.
\end{itemize}

% Abundant insights are given along our development. For example, although our primary focus is on SGD rather GD, we interpret that the bound presented in \cite{xu2017information} can tend towards infinity for GD  from an {\em edge of stability} \citep{cohen2021gradient} perspective. 

% Proofs and additional results are given in Appendix.
% this provides two opportunities. 

% First, it is possible to apply the results of \citet{xu2017information} to bound the generalization error via the MI between training set and the weight output from SGD, which can be relaxed to the MI between training set and the weight trajectory of SGD. This latter quantity can be further decomposed into the sum of step-wise conditional mutual information terms via data-processing inequality and the chain rule. Through a variational relaxation of KL divergence (Lemma \ref{lem:cmi-golden formula}), 
% we obtain a generalization bound (Theorem \ref{thm:isotropic-prior-bound}) in the form of a summation over training steps of a quantity that involves both the sensitivity of the full-batch gradient to the variation of the training set and the covariance of the gradient noise (which makes the SGD gradient deviate from the full-batch gradient). Other choices of variational relaxation of KL divergence are also exploited to give tighter bounds (e.g., Theorem \ref{thm:anisotropic-prior-bound}), where the population gradient covariance and also the covariance of the gradient noise are involved, the former only depending on the data distribution and the used model and the latter depending on the SGD dynamics. These bounds justify the significance of (the trace of) the gradient noise covariance in the generalization ability of SGD. They also suggest that the choice of model and data distribution are also factor impacting the generalization of SGD.  Additionally, these results also provide justifications for the effectiveness of regularization by controlling the gradient norm, a technique suggested by previous works. 

% Second, under mild assumptions, it is possible to obtain an estimate of the steady-state weight distribution of SDE. Using this estimate, we apply the PAC-Bayes-like information-theoretic bounds developed in \citet{negrea2019information} to obtain a generalization upper bound in terms of the KL divergence between the steady-state weight distribution of SGD with respect to a data-dependent prior distribution. Trivially choosing the prior distribution to be a Gaussian independent of training data (which reduces to MI bound of \citet{xu2017information}), we obtain a bound
% (Theorem~\ref{thm:pacbayes-isotropic-prior}) in terms of the distance of the weight output from SGD  to the prior estimate (e.g. initialization of weights). It is more interesting  choosing the prior as the steady-state weight distribution obtained by SGD on the same training set but with one example held out. In this case, the bound we obtain (Theorem \ref{thm:pacbayes-data-dependent-prior}) can be elegantly expressed using the influence function \citep{koh2017understanding}, which suggests that the generalization of the SGD is related to the stability of SGD. Comparing to the first family of bounds (i.e., those based on the MI between the training set and training trajectories), the second family of bounds directly bound the generalization error via the terminal state, which avoids summing over training steps; these bounds are in general tighter when the steady-state estimates are accurate. On the other hand, not relying on the steady-state estimates and the approximating assumptions they base upon is arguably an advantage of the first family.

% Abundant insights are given along our development, and the presented bounds are also validated numerically. Proofs and additional results are given in Appendix.
% \looseness=-1 


%Their work indeed motives us to derive generalization bounds based on the SDE approximation of SGD.
% In this paper, we model the SGD dynamics by the SDE approximation. 
%Specifically, since the gradient noise resulting from random sampling of the mini-batches is Gaussian under this approximation, SGD can be regarded as a noisy iterative process, it becomes convenient to apply  information-theoretic bounds \citep{xu2017information} to analyse generalization.
% Bear in mind that SGD is approximated by SDE, 
%Hence we use the original MI generalization bound \citep{xu2017information} to study the generalization of SGD.
%Concretely, the mutual information between the final solution of SGD and the training sample will be upper bounded by the conditional MI of the full training trajectories and the training sample. 
%Notice that in this framework, MI or conditional MI can always be converted to an expected KL divergence between a posterior and a prior (see Lemma \ref{lem:cmi-golden formula}). We then consider both the isotropic Gaussian prior and the non-isotropic Gaussian prior to derive Theorem \ref{thm:isotropic-prior-bound} and Theorem~\ref{thm:anisotropic-prior-bound}, respectively. These bounds justify the significance of trace of the gradient noise covariance in the generalization ability of SGD.  Additionally, our results also provide justifications to the effectiveness of improving generalization by controlling the gradient norm, a regularization technique suggested by previous works \citep{barrett2020implicit,smith2020origin,jastrzebski2021catastrophic,wang2022generalization,geiping2021stochastic}.

% Futher, by applying the data-dependent prior technique \citep{negrea2019information,wang2021optimizing}, we can derive another bound in Theorem ?. To be precise, we use a subset of the training sample to process a parallel SGD prior dynamic, where the updating rule of weights is still approximated by SDE ??. Since the prior process and the real SGD process have the similar anisotropic gradient noise covariance, the obtained bound could be stronger than the result derived by using an isotropic Gaussian prior.

%Although characterizing generalization by the training trajectory is insightful, the number of summation terms in both Theorem \ref{thm:isotropic-prior-bound} and Theorem \ref{thm:anisotropic-prior-bound} will grow if the training iteration increases. Then when the number of iterations is large, the bound might become immensely loose if we use the full training trajectories in the MI bound.This is, however, in stark contrast with the empirical observations, in which the generalization gap between testing loss and training loss will stay at a stable value even when training is not stopped. 
% In essence, this undesirable issue is due to the application of the data processing inequality (DPI). 


 %To overcome this problem, we connect the analysis to a PAC-Bayes view, and consider two choices of the hypothesis posterior covariance. based on a classical result in \cite{mandt2017stochastic} (see Lemma \ref{lem:posterior-covariance}), and derive generalization bounds directly for the final output of SGD (see Theorem~\ref{thm:pacbayes-isotropic-prior}, Theorem~\ref{thm:pacbayes-data-dependent-prior} and Corollary~\ref{cor:IF-pacbayes-data-prior}). These new bounds either depend on the squared $L_2$ distance between the final output and other reference weight vector or depend on the trace of the inverse Hessian. Loosely speaking, the new bounds only focus on the final output of the algorithm without explicitly relying on the full training trajectories.
%, so some redundant information is waived.  
%Another choice of posterior covariance is the inverse Fisher information matrix, giving rise to a bound (Theorem~\ref{thm:IF-pacbayes-FIM}) very similar to Takeuchi Information Criterion \cite{takeuchi1976distribution}. 
% Furthermore, it's still possible to unroll the new bound to the summation of full training trajectories, which connect these two points of view (see Corollary \ref{cor:pacbayes-gradient}). 

% The generalization bounds presented in this paper are also validated numerically.
% \vspace{-3mm}

% }


