Taming LLMs with Gradient Grouping

Taming LLMs with Gradient Grouping

ACL ARR 2025 February Submission688 Authors

10 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Training large language models (LLMs) poses unique challenges due to their vast number of parameters and complex architectures. While adaptive optimizers like Adam help mitigate parameter-wise gradient variations, they often struggle to estimate optimal learning rates across the complex parameter space of LLMs, resulting in training instability, inefficient convergence, and poor compatibility with parameter-efficient fine-tuning (PEFT) techniques such as LoRA. Inspired by the inherent low-rank structure of LLMs, we introduce \textbf{Scaling with Gradient Grouping (SGG)}, an optimizer wrapper that exploits the inherent low-rank structures to adapt learning rates through gradient clustering and cluster-specific scaling. SGG groups gradients into clusters per $k$ iterations and computes cluster-specific statistics to calibrate step sizes, simplifying learning rate estimation from high-dimensional intricacies to low-dimensional clusters. As a modular wrapper, SGG integrates effortlessly with mainstream optimizers while preserving compatibility with LoRA. Experiments on C4, GLUE, and Alpaca demonstrate SGG's superiority in achieving faster convergence, lower losses, and reduced training oscillation compared to existing methods. Its robustness across varying batch sizes and schedulers makes SGG a promising tool in the ongoing quest towards efficient and effective LLM optimization.

Paper Type: Long

Research Area: Language Modeling

Research Area Keywords: Optimization, Low-rank, LLMs, MLLMs

Contribution Types: Model analysis & interpretability, Approaches to low-resource settings, Approaches low compute settings-efficiency

Languages Studied: English, Chinese

Submission Number: 688

Loading