Abstract: Recent low-rank training methods, such as GaLore, have significantly reduced the memory required to optimize large language models (LLMs). However, these methods often suffer from time-consuming low-rank projection estimations. In particular, the singular value decomposition (SVD) in GaLore can consume more than 80\% of the total training time. To address this issue, we propose CrossLore, which uses cross-head low-rank projection to reduce the substantial time consumption in estimating low-rank projections for multi-head attention. In addition, we employ randomized subspace iteration to achieve fast SVD. To further enhance performance, we propose sparsely coded residuals to reduce the errors caused by low-rank approximation on the first- and second-order moments of the optimizers and weight updates. We evaluate CrossLore on arithmetic reasoning and natural language generation datasets. Our experiments demonstrate that CrossLore delivers superior performance while achieving approximately $4\times$ fine-tuning speed compared to vanilla GaLore.
Paper Type: Long
Research Area: Efficient/Low-Resource Methods for NLP
Research Area Keywords: large language models, parameter-efficient fine-tuning, low-rank
Contribution Types: Approaches low compute settings-efficiency
Languages Studied: English
Submission Number: 3288
Loading