Keywords: Large Language Models; Efficient Training; Low-Rank; LoRA
Abstract: Modern optimizers like Adam and Muon are central to training large language models, but their reliance on first- and second-order momenta introduces significant memory overhead, which constrains scalability and computational efficiency.
In this work, we re-frame the exponential moving average (EMA) used in these momenta as the training of a linear regressor via online gradient flow.
Building on this equivalence, we introduce LoRA-Pre, a novel low-rank optimizer designed for efficient pre-training.
Specifically, LoRA-Pre reduces the optimizer's memory footprint by decomposing the full momentum matrix into a compact low-rank subspace within the online linear learner, thereby maintaining optimization performance while improving memory efficiency.
We empirically validate LoRA-Pre's efficacy by pre-training models from the Llama architecture family, scaling from 60M to 1B parameters.
LoRA-Pre achieves the highest performance across all model sizes.
Notably, LoRA-Pre demonstrates remarkable rank efficiency, achieving comparable or superior results using only 1/8 the rank of baseline methods.
Beyond pre-training, we evaluate LoRA-Pre's effectiveness in fine-tuning scenarios.
With the same rank, LoRA-Pre consistently outperforms all efficient fine-tuning baselines.
Specifically, compared to standard LoRA, LoRA-Pre achieves substantial improvements of 3.14 points on Llama-3.1-8B and 6.17 points on Llama-2-7B, validating our approach's effectiveness across both pre-training and fine-tuning paradigms.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 3913
Loading