Motivating Next-Gen Accelerators with Flexible $N{:}M$ Activation Sparsity via Benchmarking Lightweight Post-Training Sparsification Approaches

Published: 18 Apr 2026, Last Modified: 22 Apr 2026ACL 2026 Industry Track PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM, sparsity, efficient models, inference
Abstract: The demand for efficient large language model inference has spurred interest in sparsification, yet current hardware support remains narrowly focused on 2:4 weight sparsity. In this work, we argue that activation sparsity despite being overlooked in hardware design offers a promising path for dynamic, input-adaptive compression with significant I/O and memory benefits. We present a comprehensive post-training study of $N{:}M$ activation pruning across four LLMs (Llama2-7B-chat, Llama3.1-8B-Instruct, Qwen2.5-7B-Instruct, Gemma3-4B-Instruct), demonstrating that activation pruning consistently outperforms weight pruning at matched sparsity levels. We evaluate lightweight, plug-and-play error mitigation and selection strategies that require minimal or no calibration data across four sparsity patterns: 2:4, 4:8, 8:16, and 16:32. Among these, 16:32 approaches the performance of unstructured 50\% sparsity and is is approximately 2.7$\times$ better than 2:4, while 8:16 offers an optimal balance of accuracy and practicality. Our results provide evidence that next-generation accelerators should consider native support for $N{:}M$ activation sparsity and can serve as a strong baseline for the future methods.
Submission Type: Emerging
Copyright Form: pdf
Submission Number: 47
Loading