GRASS: Compute Efficient Low-Memory LLM Training with Structured Sparse Gradients

Aashiq Muhamed; Oscar Li; David Woodruff; Mona T. Diab; Virginia Smith

GRASS: Compute Efficient Low-Memory LLM Training with Structured Sparse Gradients

Aashiq Muhamed, Oscar Li, David Woodruff, Mona T. Diab, Virginia Smith

Published: 21 Jun 2024, Last Modified: 26 Jul 2024ES-FoMo-II 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Training efficiency, Memory efficiency, Throughput, Sparse Projection, Structured Sparsity, Gradient Projection

TL;DR: GRASS (GRAdient Structured Sparsification) is designed to reduce memory and compute requirements for training LLMs by using structured sparse projections to transform gradients.

Abstract: Large language model (LLM) training and finetuning are often severely constrained by limited GPU memory. While parameter-efficient finetuning techniques like LoRA address this by learning low-rank weight updates, they frequently underperform compared to full-rank training, especially during pretraining. We propose GRASS (GRAdient Stuctured Sparsification), a novel approach that slashes LLM training memory and compute requirements without compromising performance. GRASS leverages sparse projections to transform gradients into structurally sparse gradients, significantly lowering memory usage for both optimizer states and gradient communication. This compression, in turn, unlocks substantial throughput improvements. Extensive experiments on pretraining and finetuning tasks demonstrate that GRASS achieves comparable performance to existing projection-based optimizers and full-rank training. Notably, GRASS enables pretraining a 13B parameter LLaMA model on a single 40GB A100 GPU---a feat infeasible for previous methods---and yields up to a $2\times$ throughput improvement on an 8-GPU system.

Submission Number: 65

Loading