Batch-wise Adaptive Pruning: Periodic Neuron Activation-Aware Weight Pruning for Language Reasoning Model

ACL ARR 2026 January Submission8258 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models, Activation Sparsity, Pruning, Efficiency
Abstract: Large Reasoning Models (LRMs) achieve strong performance on complex tasks through extended chain-of-thought generation, but incur substantial computational costs during inference. In production settings, batched inference is essential for high throughput, yet existing adaptive pruning methods face performance limitations: First, they rely on averaging activations across samples to determine shared pruning masks, which may miss critically activated neurons for some individual samples. Second, after the averaging, they adopt threshold-based selection for pruning neurons, causing sparsity ratio instability. In this work, we propose a training-free adaptive pruning method designed specifically for batched inference in LRMs. Our method adopts max-pooling activations for cross-sample aggregation to better capture the sample-specific important neurons. To stabilize the activation sparsity ratio, our method adopts periodic top-k selection over the aggregated neurons instead of threshold-based selection. Futhremore, based on the observation that important neurons tend to be repeatedly activated, we incorporate an activation memory mechanism to capture periodically important neurons. Experiments on diverse reasoning benchmarks demonstrate that our method achieves a 39.7\% improvement over the previous state-of-the-art adaptive pruning method at batch size 4 with 50\% sparsity, along with $1.32\times$ speedup over dense inference at batch size 1 and $1.14\times$ at batch size 4, which is comparable to existing pruning methods, demonstrating practical efficiency gains for deployment.
Paper Type: Long
Research Area: LLM Efficiency
Research Area Keywords: Large Language Models, Activation Sparsity, Pruning, Efficiency
Contribution Types: Model analysis & interpretability, Approaches low compute settings-efficiency
Languages Studied: English
Submission Number: 8258
Loading