Abstract: Leveraging large language models (LLMs) to fuse heterogeneous knowledge is an exciting emerging field. However, with billions of parameters, these pretrained language models are prohibitively computationally expensive at inference time. Token sparsification methods can proactively accelerate inference by selecting important features from the sequence but often require task-dependent retraining. To address this, we propose Bilevel Token prUniNg wiTh Infused kNowledGe (Bunting), an interpretable token pruning method that leverages task-level knowledge encoded in prefixes to guide token sparsification, eliminating the need for task-specific retraining. Bunting performs Bayesian Token Sparsification, where the inner loop learns a joint representation to perform the task, and the outer loop learns adaptive attention masks for sparse representations, thus pruning redundant tokens layer-by-layer without compromising the pretrained abilities of LLMs. Additionally, we introduce an innovative antiphrasis evaluation protocol to test model adaptivity on rhetorical relations. Furthermore, we demonstrate that precomputed prefixes can effectively guide token sparsification in different knowledge-intensive tasks, maintaining task-level knowledge to identify important tokens and reduce the finetuning burden. Experimental results demonstrate that our method achieves over $0.3 x$ wall-clock speed-up with only $0.14 \times$ learnable parameters in knowledge-intensive tasks. Our findings suggest that token pruning can improve out-of-distribution detection, with sarcasm being more challenging to detect than immorality.
Loading