CLAWS: Calibration-Aware Activation Sparsity for Instruction-Tuned LLMs

Published: 01 Jun 2026, Last Modified: 11 Jun 2026AdaptFM PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Sparse neural networks, model pruning, knowledge distillation, efficient inference, transformer compression, activation sparsity, parameter-efficient learning, large language models
TL;DR: CLAWS makes activation sparsity practical by combining Fisher-calibrated routing with hardware-friendly row-sparse masks, improving Gemma-4-E2B-IT MMLU at 50% FFN density while delivering a real ARM MLP speedup.
Abstract: Modern LLMs contain gated MLPs whose per-token activations are highly skewed, a natural axis for dynamic top-K sparsification. Practical sparsity needs both accurate routing and a mask the inference kernel can execute efficiently. CATS (Lee et al., 2024) gives the hardware-friendly row-wise mask but routes on local gate information alone, while LaRoSA (Liu et al., 2025) improves routing through rotating the basis with calibration data, but induces a column-sparse pattern poorly aligned with quantized row-major layouts. We introduce CLAWS, a gradient-saliency calibration method for gated-MLP sparsification that keeps the CATS row-sparse execution pattern and multiplies the runtime gate score by a static per-neuron saliency constant, so the top-K mask still maps directly to quantized kernels. We also explain why this multiplier reorders the native gate score but barely moves LaRoSA's rotated-input score, isolating where calibration helps. On Gemma-4-E2B-IT at 50\% FFN density, CLAWS recovers dense-level Avg5 and surpasses matched-recipe CATS with LoRA by 6.8pp MMLU (49.6 vs. 42.8), and a custom ARM kernel delivers a $\times 1.24$ wall-clock speedup on the full MLP block.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 213
Loading