Keywords: gradient clipping, optimization dynamics, non-convex optimization, Kurdyka-Łojasiewicz condition, scaling laws, edge of stability, adaptive optimizers, δ-GClip, protein diffusion
Abstract: This work provides an extended empirical and theoretical analysis of the proposed recently $\delta$-GClip, a variant of gradient clipping with a formal convergence guarantee. Our experiments analyze activation patterns, gradient dynamics, and dependence of $\delta$ on architectural scale across supervised benchmarks, diffusion models, and a lightweight protein‑generation task. In particular, we show that combining Adam with a brief $\delta$‑clipping warm‑up improves the stability and early‑phase optimization in diffusion model training. Using the Kurdyka–Łojasiewicz framework we further extend the convergence guarantees of $\delta$-GClip beyond the squared‑loss setting to more general smooth non‑convex objectives.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 82
Loading