Sparsity Distribution Matters: REACT for Accelerating Large Language Models

Sparsity Distribution Matters: REACT for Accelerating Large Language Models

ICLR 2026 Conference Submission18575 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: inference methods; sparse models; pruning;

TL;DR: Sparsity Distribution Matters—REACT optimizes activation sparsity for up to 1.33× faster LLM decoding without finetuning.

Abstract: Efficient inference for large language models (LLMs) is critical for real-world deployment, yet it requires substantial computational and memory resources. Fortunately, activation sparsity alleviates these demands by enabling the skipping of low-magnitude activations, which reduces both arithmetic operations and memory access. However, existing methods primarily focus on maximizing the overall sparsity, but they overlook the impact of sparsity distribution in the inference network. Our empirical study with current methods reveals that sparsity distribution is more critical than the overall sparsity ratio for acceleration. Therefore, we propose REACT, a training-free sparsification method that optimizes sparsity distribution within the Multi-Layer Perceptron (MLP) module, improving inference speed without sacrificing model performance. Specifically, we empirically select the best location for sparsification in an MLP and develop an optimized sparsity-aware GPU kernel for inference, which reduces memory access overhead and improves computational efficiency. Our experiments on LLaMA2-7B and Mistral-7B demonstrate that REACT achieves speedups of 1.26× and 1.33×, respectively, while maintaining nearly the same model accuracy as their baselines. These results highlight the importance of rethinking sparsity distribution for efficient LLM inference.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 18575

Loading