Keywords: Randomization, Low-Rank Gradients, Efficiency, Efficient LLM Training
Abstract: Low-rank gradient optimization for large language models is currently divided into two categories: structured methods that rigorously identify subspaces, and randomized approaches employed primarily for computational efficiency. We question the intuition behind why random projections are effective, tracing this phenomenon to the geometry of the gradient space. Finding that subspace optimization landscape is nearly flat, while a significant portion of gradient information lies outside the core subspace, we introduce GrassWalk and GrassJump, algorithms that navigate the Grassmannian manifold via random walks and jumps. By coupling this randomized exploration with subspace-aware optimizer and recovering the lost gradient signals, we achieve state-of-the-art results. Our findings reframe randomization not merely as a computational shortcut, but as a geometrically principled approach to high-dimensional optimizations.
Submission Number: 113
Loading