SAC: Adaptive Learning Rate Scaling with Architectural Constraints

Siyuan Li; Juanxi Tian; Zedong Wang; Anna Wang; Xin Jin; Chang Yu; Ruoyu Sun; Cheng Tan

SAC: Adaptive Learning Rate Scaling with Architectural Constraints

Siyuan Li, Juanxi Tian, Zedong Wang, Anna Wang, Xin Jin, Chang Yu, Ruoyu Sun, Cheng Tan

11 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Language Modeling, Optimization, Neural Network Architectures, LLMs, MLLMs

TL;DR: This paper proposes an optimizer wrapper (SAC) for modern DNNs, which constrains the adaptive learning rate with hierarchical optimization states estimation and equalization scaling at hierarchical levels.

Abstract: The design of optimizers for modern Large Language Models (LLMs) is governed by the critical trade-off between performance, memory footprint, and computational throughput. High-accuracy methods, such as those exploiting gradient preconditioning techniques, are often memory-intensive and may introduce significant computational overhead, while efficient ones like Galore may not reach the same performance level. In this work, we present Scaling with Architectural Constraints (SAC), an optimizer wrapper that navigates these competing demands for the first time. SAC enhances existing adaptive optimizers by modulating per-parameter learning rates with lightweight, hierarchical constraints derived from model architectures. On the C4 pre-training benchmark, SAC+AdamW achieves state-of-the-art perplexity from 60M to 3B model sizes, converging faster without incurring the high costs of complex preconditioning. It also enhances training stability, showcasing robustness across varied learning rates and batch sizes. Qualitatively, empirical analysis shows that SAC fosters a more coordinated optimization process, leading to improved gradient dynamics. Its versatility has been further validated by the strong results across downstream tasks and domains, including long sequence modeling, parameter-efficient fine-tuning, image classification with diverse models like ViTs and CNNs, and evaluations on multimodal benchmarks.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 4105

Loading