La RoSA: Enhancing LLM Efficiency via Layerwise Rotated Sparse Activation

Kai Liu; Bowen Xu; Shaoyu Wu; Xin Chen; Hao Zhou; Yongliang Tao; lulu hu

La RoSA: Enhancing LLM Efficiency via Layerwise Rotated Sparse Activation

Kai Liu, Bowen Xu, Shaoyu Wu, Xin Chen, Hao Zhou, Yongliang Tao, lulu hu

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Activation sparsity can reduce the computational overhead and memory transfers during the forward pass of Large Language Model (LLM) inference. Existing methods face limitations, either demanding time-consuming recovery training that hinders real-world adoption, or relying on empirical magnitude-based pruning, which causes fluctuating sparsity and unstable inference speed-up. This paper introduces LaRoSA (**La**yerwise **Ro**tated **S**parse **A**ctivation), a novel method for activation sparsification designed to improve LLM efficiency without requiring additional training or magnitude-based pruning. We leverage layerwise orthogonal rotations to transform input activations into rotated forms that are more suitable for sparsification. By employing a Top-K selection approach within the rotated activations, we achieve consistent model-level sparsity and reliable wall-clock time speed-up. LaRoSA is effective across various sizes and types of LLMs, demonstrating minimal performance degradation and robust inference acceleration. Specifically, for LLaMA2-7B at 40\% sparsity, LaRoSA achieves a mere 0.17 perplexity gap with a consistent 1.30× wall-clock time speed-up, and reduces the accuracy gap in zero-shot tasks compared to the dense model to just 0.54\%, while surpassing TEAL by 1.77\% and CATS by 17.14\%.

Lay Summary: Large Language Models (LLMs) are powerful but can be resource-intensive, slowing down their performance. One way to make them more efficient is by reducing the amount of active computations, a process called activation sparsity. However, current methods either require additional training, which is time-consuming, or rely on unreliable techniques that can lead to inconsistent performance. Our new approach, LaRoSA, aims to address these issues by enhancing the efficiency of LLMs without the need for extra training or unstable methods. We use a clever technique that involves rotating the data within each layer of the model, making it easier to identify and keep only the most important parts for processing. This ensures that the model remains fast and efficient without losing much accuracy. LaRoSA has been tested on various LLMs and shown to maintain high performance while speeding up the processing time significantly. For example, when applied to a specific model, LLaMA2-7B, LaRoSA reduces the processing time by 1.30 times while maintaining a high level of accuracy, outperforming other existing methods. This makes it a promising solution for improving the efficiency of LLMs in real-world applications.

Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.

Primary Area: Deep Learning->Large Language Models

Keywords: Sparse Activation, Large Language Models, Orthogonal Transformation, LLM Inference

Submission Number: 6134

Loading