GRASS: Compute Efficient Low-Memory LLM Training with Structured Sparse Gradients

GRASS: Compute Efficient Low-Memory LLM Training with Structured Sparse Gradients

ACL ARR 2024 June Submission272 Authors

08 Jun 2024 (modified: 11 Aug 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large language model (LLM) training and finetuning are often bottlenecked by limited GPU memory. While existing projection-based optimization methods address this by projecting gradients into a lower-dimensional subspace to reduce optimizer state memory, they typically rely on \textit{dense} projection matrices, which can introduce computational and memory overheads. In this work, we propose \textsc{Grass} (GRAdient Stuctured Sparsification), a novel approach that leverages \textit{sparse} projections to transform gradients into structured sparse updates. This design not only significantly reduces memory usage for optimizer states but also minimizes gradient memory footprint, computation, and communication costs, leading to substantial throughput improvements. Extensive experiments on pretraining and finetuning tasks demonstrate that \textsc{Grass} achieves comparable performance to full-rank training and existing projection-based methods. Notably, \textsc{Grass} enables half-precision pretraining of a 13B parameter LLaMA model on a single 40GB A100 GPU---a feat infeasible for previous methods---and yields up to a $2\times$ throughput improvement on an 8-GPU system.

Paper Type: Long

Research Area: Efficient/Low-Resource Methods for NLP

Research Area Keywords: parameter-efficient-training, NLP in resource-constrained settings

Contribution Types: Approaches low compute settings-efficiency

Languages Studied: English

Submission Number: 272

Loading