Compute Where It Counts: Adaptive Compute Allocation for Large Language Models via Learned Granular Sparsity

ACL ARR 2025 May Submission2491 Authors

19 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: The inference of Large Language Models (LLMs) requires massive amounts of computation. Sparsity-aware inference pipelines can alleviate this cost by reducing the number of parameters used in each forward pass. We introduce "granular sparsity", a novel method for reducing compute requirements. By decomposing matrix columns, the standard unit of sparsity, into smaller $stripes$, we create a flexible method of conditional computation that is more expressive than existing sparsity strategies. Furthermore, we introduce a novel method for learning and controlling sparsity, which is inspired by sparse autoencoders. Notably, our method allows the model to designate different levels of sparsity to different input and layers. We validate our methods by distilling 2-6x more compute-efficient sparse language models from Llama 3.2 1B. Interestingly, we show evidence that our model allocates more computation to answering questions that humans deem more difficult.
Paper Type: Long
Research Area: Efficient/Low-Resource Methods for NLP
Research Area Keywords: pruning,distillation,parameter-efficient-training,data-efficient training,LLM Efficiency,hardness of samples, sparse models, efficient models, model architectures,
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches low compute settings-efficiency
Languages Studied: English
Submission Number: 2491
Loading