Effect of Random Learning Rate: Theoretical Analysis of SGD Dynamics in Non-Convex Optimization via Stationary Distribution

Effect of Random Learning Rate: Theoretical Analysis of SGD Dynamics in Non-Convex Optimization via Stationary Distribution

TMLR Paper4002 Authors

17 Jan 2025 (modified: 03 Aug 2025)Decision pending for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: We consider a variant of the stochastic gradient descent (SGD) with a random learning rate and reveal its convergence properties. SGD is a widely used stochastic optimization algorithm in machine learning, especially deep learning. Numerous studies reveal the convergence properties of SGD and its theoretically favorable variants. Among these, the analysis of convergence using a stationary distribution of updated parameters provides generalizable results. However, to obtain a stationary distribution, the update direction of the parameters must not degenerate, which limits the applicable variants of SGD. In this study, we consider a novel SGD variant, Poisson SGD, which has degenerated parameter update directions and instead utilizes a random learning rate. Consequently, we demonstrate that a distribution of a parameter updated by Poisson SGD converges to a stationary distribution under weak assumptions on a loss function. Based on this, we further show that Poisson SGD finds global minima in non-convex optimization problems and also evaluate the generalization error using this method. As a proof technique, we approximate the distribution by Poisson SGD with that of the bouncy particle sampler (BPS) and derive its stationary distribution, using the theoretical advance of the piece-wise deterministic Markov process (PDMP).

Submission Length: Regular submission (no more than 12 pages of main content)

Previous TMLR Submission Url: https://openreview.net/forum?id=iRvwtiAaDy&referrer=%5BAuthor%20Console%5D(%2Fgroup%3Fid%3DTMLR%2FAuthors%23your-submissions

Changes Since Last Submission: In the previous submission, several issues, such as those listed below, were pointed out, and we have addressed each of them individually. >**C1**. Experimental results are not insufficiently explained and authors' rebuttal does not support their claim. The baselines against which authors benchmarked are simplified and experimental setting for provided baseline methods is unclearly stated. We made two major updates: First, we provided more detailed descriptions of the experimental setup in Section A.7, which includes the data generation processes, specifics of the models used, hyperparameters of the algorithms, and the settings of baseline methods. Second, in Section 7.1, we plotted the distribution of parameters generated by the proposed method, Poisson SGD (Figure 2), as well as the distance between this distribution and the theoretical stationary distribution (Figure 3). Through these updates, we directly validated Theorem 1 by demonstrating experimentally that the parameter distribution generated by the algorithm approaches the theoretical stationary distribution we derived. >**C2**. Several issues around assumptions, e.g., boundedness of the feasible and not involving a projection step. To address this issue, we modified the parameter space $\Theta$ to be on a torus $(\mathbb{R}/a\mathbb{Z})^d$ with $a>0$. Additionally, we adapted the design of Poisson SGD (Algorithm 1), the convergence theorem (Theorem 1) and its proof, and the experimental results (Section 7) to accommodate this torus setting. Specifically, we incorporated the mod operation into the algorithm and conducted convergence theorems and experiments under this modification. As a result, we were able to handle a bounded parameter space without introducing a projection step, thereby achieving theoretical consistency. > **C3**. Reviewers found significant gap between theoretical bounds and experimental results. In summary, we took this feedback seriously and reinforced the theoretical setting by changing the parameter space and updating the algorithm and theory. Furthermore, by adding experiments that demonstrate convergence to the stationary distribution, we bridged the gap between theory and experiments.

Assigned Action Editor: ~Murat_A_Erdogdu1

Submission Number: 4002

Loading