Compute Where It Counts: Adaptive Compute Allocation for Large Language Models via Learned Granular Sparsity
Keywords: contextual sparsity, learnable sparsity, granular sparsity, parameter-efficient-training, LLM Efficiency, hardness of samples, sparse models, efficient models
Abstract: Sparsity-aware inference can dramatically shrink computation requirements by reducing the number of parameters used in each forward pass. Existing methods tend to be heuristic (zeroing activations below fixed thresholds, retaining top-K activations etc). These methods do not directly optimize individual thresholds using gradient-based methods and experience sharp performance degradation beyond 50% sparsity. This paper describes CWIC (Compute Where it Counts), a method that makes sparsity thresholds learnable and contextual. CWIC encourages conditional computation that allows model to designate different levels of sparsity to different input and layers. We also introduce “granular sparsity" that decomposes matrix columns into smaller "stripes" for more expressive sparsity patterns. CWIC and granular sparsity enable distilling 2-6x compute-efficient sparse models from Llama 3.2-1B and Llama 3.2-3B. Notably, CWIC models are found to allocate little compute to filler words or replicated text and more compute to questions humans deem challenging.
Primary Area: other topics in machine learning (i.e., none of the above)
Submission Number: 15621
Loading