Keywords: Language Modeling, Optimization, Neural Network Architectures, LLMs, MLLMs
TL;DR: This paper proposes an optimizer wrapper (SAC) for modern DNNs, which constrains the adaptive learning rate with hierarchical optimization states estimation and equalization scaling at hierarchical levels.
Abstract: The design of optimizers for modern Large Language Models (LLMs) is governed by the critical trade-off between performance, memory footprint, and computational throughput. High-accuracy methods, such as those exploiting gradient preconditioning techniques, are often memory-intensive and may introduce significant computational overhead, while efficient ones like Galore may not reach the same performance level. In this work, we present Scaling with Architectural Constraints (SAC), an optimizer wrapper that navigates these competing demands for the first time. SAC enhances existing adaptive optimizers by modulating per-parameter learning rates with lightweight, hierarchical constraints derived from model architectures. On the C4 pre-training benchmark, SAC+AdamW achieves state-of-the-art perplexity from 60M to 3B model sizes, converging faster without incurring the high costs of complex preconditioning. It also enhances training stability, showcasing robustness across varied learning rates and batch sizes. Qualitatively, empirical analysis shows that SAC fosters a more coordinated optimization process, leading to improved gradient dynamics. Its versatility has been further validated by the strong results across downstream tasks and domains, including long sequence modeling, parameter-efficient fine-tuning, image classification with diverse models like ViTs and CNNs, and evaluations on multimodal benchmarks.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 4105
Loading