FTP: A Fine-grained Token Pruner for Large Language Models via Token Routing

FTP: A Fine-grained Token Pruner for Large Language Models via Token Routing

ACL ARR 2025 February Submission250 Authors

05 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract:

The substantial computational overhead of large language models (LLMs) often presents a major challenge for their deployment in industrial applications. Many works leverage traditional compression approaches to boost model inference, however, these methods typically incur additional training costs to restore performance by updating the LLM's weights. Alternatively, pruning often results in significant performance degradation compared to the original model when aiming for a specific level of acceleration. To address these issues, we propose a fine-grained token pruning approach for the LLMs, which presents a learnable router to adaptively identify the less important tokens and skip them across model blocks to reduce computational cost during inference. To construct the router efficiently, we present a search-based sparsity scheduler for pruning sparsity allocation, a trainable router combined with our proposed four low-dimensional factors as input and three proposed losses. Furthermore, we introduce a one-pass learnable router designed for batch inference and enhanced acceleration. We have conducted extensive experiments across different benchmarks on different LLMs to demonstrate the superiority of our method. Our approach achieves state-of-the-art (SOTA) pruning results, surpassing other existing pruning methods. For instance, our method outperforms BlockPruner and ShortGPT by approximately 10 points on both LLaMA2-7B and Qwen1.5-7B in accuracy retention at comparable token sparsity levels.

Paper Type: Long

Research Area: Efficient/Low-Resource Methods for NLP

Research Area Keywords: Large Language Models, Token Pruning, Model Optimization and Acceleration

Contribution Types: Approaches to low-resource settings, Approaches low compute settings-efficiency

Languages Studied: English

Submission Number: 250

Loading