CATS: Context-Aware Thresholding for Sparsity in Large Language Models

Donghyun Lee; Jaeyong Lee; Genghan Zhang; Mo Tiwari; Azalia Mirhoseini

CATS: Context-Aware Thresholding for Sparsity in Large Language Models

Donghyun Lee, Jaeyong Lee, Genghan Zhang, Mo Tiwari, Azalia Mirhoseini

Published: 10 Jul 2024, Last Modified: 26 Aug 2024COLMEveryoneRevisionsBibTeXCC BY 4.0

Research Area: Compute efficient LMs, Inference algorithms for LMs

Keywords: efficient inference, sparsity, context-aware inference

TL;DR: We make LLMs faster and more efficient by inducing sparsity; our prescription for doing so applies to many LLMs.

Abstract: The dramatic improvements in Large Language Models (LLMs) come at the cost of increased computational resources for inference. Recent studies ameliorate the computational costs of LLMs by increasing their activation sparsity but suffer from significant performance degradation on downstream tasks. In this work, we introduce a new framework for sparsifying the activations of LLMs and reducing inference costs, dubbed $\underline{C}$ontextually $\underline{A}$ware $\underline{T}$hresholding for $\underline{S}$parsity (CATS). CATS is a relatively simple algorithm that is easy to implement and highly effective. At the heart of our framework is a new non-linear activation function. We demonstrate that CATS can be applied to various models, including Mistral-7B and Llama2-7B \& 13B, and outperforms existing sparsification techniques across multiple tasks. More precisely, CATS-based models achieve downstream task performance within $\sim$ 99\% of their base models at activation sparsity levels of 50\%, even without any fine-tuning. Moreover, with fine-tuning that targets only 1\% of the parameters, CATS-based models not only converge faster but also achieve better task performance than competing techniques. Finally, we develop a custom GPU kernel for the efficient implementation of CATS that translates the activation sparsity of CATS to real wall-clock time speedups. Our custom kernel implementation of CATS results in a $\sim$15\% improvement in wall-clock inference latency of token generation. We release our code, experiments, and datasets at https://github.com/ScalingIntelligence/CATS.

Supplementary Material: zip

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the COLM Code of Ethics on https://colmweb.org/CoE.html

Author Guide: I certify that this submission complies with the submission instructions as described on https://colmweb.org/AuthorGuide.html

Submission Number: 253

Loading