SparseLoRA: Accelerating LLM Fine-Tuning with Contextual Sparsity

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: This paper introduces SparseLoRA, a method that uses contextual sparsity to accelerate LLM fine-tuning, cutting compute by up to 2× and runtime by 1.5× while maintaining model accuracy on reasoning, coding, and instruction-following tasks.
Abstract: Fine-tuning LLMs is both computationally and memory-intensive. While parameter-efficient fine-tuning methods, such as QLoRA and DoRA, reduce the number of trainable parameters and lower memory usage, they do not decrease computational cost. In some cases, they may even slow down fine-tuning. In this paper, we introduce SparseLoRA, a method that accelerates LLM fine-tuning through contextual sparsity. We propose a lightweight, training-free SVD sparsity estimator that dynamically selects a sparse subset of weights for loss and gradient computation. Also, we systematically analyze and address sensitivity across layers, tokens, and training steps. Our experimental results show that SparseLoRA reduces computational cost by up to $2.0\times$ and a measured speedup of up to $1.5\times$ while maintaining accuracy across various downstream tasks, including commonsense and arithmetic reasoning, code generation, and instruction following.
Lay Summary: Fine-tuning large language models (LLMs) for new tasks typically requires significant compute and memory. Recent techniques, like QLoRA and DoRA, make fine-tuning more memory-efficient by reducing how many model parameters change during training. However, these methods are often less runtime efficient, slowing down fine-tuning. In our work, we introduce SparseLoRA, a new approach that makes fine-tuning faster by carefully choosing only a small, important subset of parameters to activate at each training step. We use a lightweight, training-free estimator based on singular value decomposition (SVD) to efficiently predict which parts of the model can be skipped during training based on the input activations and weight characteristics. We also thoroughly analyze how this method behaves differently across layers, input tokens, and training phases to ensure stability. Experiments show that SparseLoRA halves computational costs and achieves up to $1.5\times$ faster fine-tuning and $2\times$ computational savings, all without sacrificing model accuracy on various tasks like reasoning, coding, and following instructions. Our work offers a scalable and computationally efficient solution for fine-tuning modern LLMs.
Link To Code: https://github.com/z-lab/sparselora
Primary Area: Deep Learning->Large Language Models
Keywords: large language models, efficient fine-tuning, sparsity, peft
Submission Number: 14607
Loading