Keywords: Sparse neural networks, model pruning, knowledge distillation, efficient inference, transformer compression, activation sparsity, parameter-efficient learning, large language models
TL;DR: CLAWS makes activation sparsity practical by combining Fisher-calibrated routing with hardware-friendly row-sparse masks, improving Gemma-4-E2B-IT MMLU at 50% FFN density while delivering a real ARM MLP speedup.
Abstract: Modern LLMs contain gated MLPs whose per-token activations are highly skewed, a natural axis for dynamic top-K
sparsification. Practical sparsity needs both accurate routing and a mask the inference kernel can execute efficiently.
CATS (Lee et al., 2024) gives the hardware-friendly row-wise mask but routes on local gate information alone, while
LaRoSA (Liu et al., 2025) improves routing through rotating the basis with calibration data, but induces a column-sparse
pattern poorly aligned with quantized row-major layouts. We introduce CLAWS, a gradient-saliency calibration method for
gated-MLP sparsification that keeps the CATS row-sparse execution pattern and multiplies the runtime gate score by a
static per-neuron saliency constant, so the top-K mask still maps directly to quantized kernels. We also explain why
this multiplier reorders the native gate score but barely moves LaRoSA's rotated-input score, isolating where
calibration helps. On Gemma-4-E2B-IT at 50\% FFN density, CLAWS recovers dense-level Avg5 and surpasses matched-recipe
CATS with LoRA by 6.8pp MMLU (49.6 vs. 42.8), and a custom ARM kernel delivers a $\times 1.24$ wall-clock speedup on the full
MLP block.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 213
Loading