FTP: Efficient Prefilling for Long-Context LLM Inference via FFN Token Pruning

Guo-Hao Xu; Jingzhen Ding; Huping Ding; Zhao Xu; Kaifu Zhang

FTP: Efficient Prefilling for Long-Context LLM Inference via FFN Token Pruning

Guo-Hao Xu, Jingzhen Ding, Huping Ding, Zhao Xu, Kaifu Zhang

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large language model, Inference acceleration, Token pruning, Long-context inference, Natural Language Processing

Abstract: Large Language Models (LLMs) have demonstrated remarkable performance across various NLP tasks, and have extended their capability to long-context scenarios. However, the increasing context length leads to longer inference time in both the prefilling and decoding stages. Existing token pruning methods primarily evict tokens to compress the KV cache, and only accelerate the decoding stage. Recent studies have extended token pruning to both stages, but they either yield subtle speedup during the prefilling stage or defer a portion of computations to the decoding phase. Critically, these approaches prioritize the attention module, overlooking the significant computations in the Feed-Forward Network (FFN) module. In this work, we focus on the prefilling stage and propose a novel token pruning method named FTP for long-context LLM inference. Our approach is based on the observation that the FFN module accounts for over 60\% of the inference time. FTP reduces this by pruning non-critical tokens before the inference of FFN. The importance of each token, along with the quantity to be pruned, are dynamically determined by the attention scores in each layer. Unlike previous token pruning methods, FTP preserves a substantial amount of information of the pruned tokens through the residual connection, thereby achieving a notable speedup with only a negligible decrease in performance. Specifically, the Qwen2-7B-Instruct model with FTP achieves a speedup of 1.24$\times$ in the prefilling stage with only a 1.30\% performance drop compared to the baseline model. The speedup is further boosted to 1.39$\times$ on a Qwen1.5-32B-Chat model. Extensive experiments on long-context datasets across various tasks demonstrate the potential and effectiveness of FTP.

Primary Area: foundation or frontier models, including LLMs

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 8789

Loading