Attribution-Driven Adaptive Token Pruning for Transformers

Yaoyao Yan; Hui Yu; Weizhi Xu

Attribution-Driven Adaptive Token Pruning for Transformers

Yaoyao Yan, Hui Yu, Weizhi Xu

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Token Pruning, Attribution-Driven, Adaptive Inference, Integrated Gradients.

Abstract: Transformers have been widely adopted in natural language processing, computer vision, and other domains due to their exceptional performance across a variety of tasks. However, the computational cost of Transformers is prohibitively high, particularly when handling long input sequences, significantly increasing both training and inference time. Although various token pruning methods have been proposed to reduce the computational burden of Transformers, most approaches overlook critical differences in sequences in terms of length and complexity, leading to suboptimal compression efficiency. In this paper, we propose AD-TP, an Attribution-Driven Adaptive Token Pruning method designed to retain only the most informative tokens. We analyze the performance of using accumulated attention values to measure token importance and find that attention values do not accurately reflect the actual contribution of each token to text understanding. Additionally, we observe significant variations in the length and complexity of different sequences within the dataset. Based on these insights, we adopt Integrated Gradients to evaluate token importance and introduce a lightweight adaptive token retainer module that dynamically generates pruning configurations for each input sequence. In addition, we incorporate both teacher supervision and self-supervised learning objectives to enhance the training efficiency, accuracy, and robustness of the model. Experiments conducted on GLUE, SQuAD, and 20News demonstrate that AD-TP outperforms state-of-the-art token pruning and model compression methods in both accuracy and computational efficiency. On GLUE, AD-TP reduces FLOPs by an average of 7.8× while improving performance by 0.6%.

Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)

Submission Number: 20198

Loading