Compute Where It Counts: Adaptive Compute Allocation for Large Language Models via Learned Granular Sparsity

Compute Where It Counts: Adaptive Compute Allocation for Large Language Models via Learned Granular Sparsity

ACL ARR 2025 May Submission2491 Authors

19 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: The inference of Large Language Models (LLMs) requires massive amounts of computation. Sparsity-aware inference pipelines can alleviate this cost by reducing the number of parameters used in each forward pass. We introduce "granular sparsity", a novel method for reducing compute requirements. By decomposing matrix columns, the standard unit of sparsity, into smaller $stripes$, we create a flexible method of conditional computation that is more expressive than existing sparsity strategies. Furthermore, we introduce a novel method for learning and controlling sparsity, which is inspired by sparse autoencoders. Notably, our method allows the model to designate different levels of sparsity to different input and layers. We validate our methods by distilling 2-6x more compute-efficient sparse language models from Llama 3.2 1B. Interestingly, we show evidence that our model allocates more computation to answering questions that humans deem more difficult.

Paper Type: Long

Research Area: Efficient/Low-Resource Methods for NLP

Research Area Keywords: pruning,distillation,parameter-efficient-training,data-efficient training,LLM Efficiency,hardness of samples, sparse models, efficient models, model architectures,

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches low compute settings-efficiency

Languages Studied: English

Submission Number: 2491

Loading