Research Area: Compute efficient LMs, Inference algorithms for LMs
Keywords: efficient inference, sparsity, context-aware inference
TL;DR: We make LLMs faster and more efficient by inducing sparsity; our prescription for doing so applies to many LLMs.
Abstract: The dramatic improvements in Large Language Models (LLMs) come at the cost of increased computational resources for inference. Recent studies ameliorate the computational costs of LLMs by increasing their activation sparsity but suffer from significant performance degradation on downstream tasks.
In this work, we introduce a new framework for sparsifying the activations of LLMs and reducing inference costs, dubbed $\underline{C}$ontextually $\underline{A}$ware $\underline{T}$hresholding for $\underline{S}$parsity (CATS).
CATS is a relatively simple algorithm that is easy to implement and highly effective.
At the heart of our framework is a new non-linear activation function.
We demonstrate that CATS can be applied to various models, including Mistral-7B and Llama2-7B \& 13B, and outperforms existing sparsification techniques across multiple tasks.
More precisely, CATS-based models achieve downstream task performance within $\sim$ 99\% of their base models at activation sparsity levels of 50\%, even without any fine-tuning.
Moreover, with fine-tuning that targets only 1\% of the parameters, CATS-based models not only converge faster but also achieve better task performance than competing techniques.
Finally, we develop a custom GPU kernel for the efficient implementation of CATS that translates the activation sparsity of CATS to real wall-clock time speedups.
Our custom kernel implementation of CATS results in a $\sim$15\% improvement in wall-clock inference latency of token generation. We release our code, experiments, and datasets at https://github.com/ScalingIntelligence/CATS.
Supplementary Material: zip
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the COLM Code of Ethics on https://colmweb.org/CoE.html
Author Guide: I certify that this submission complies with the submission instructions as described on https://colmweb.org/AuthorGuide.html
Submission Number: 253
Loading