CATP: Cross-Attention Token Pruning for Accuracy Preserved Multimodal Model Inference

Published: 01 Jan 2025, Last Modified: 04 Nov 2025CAI 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: In response to the growing interest in large multimodal models, we introduce Cross-Attention Token Pruning (CATP), an accuracy-preserved token pruning method. Our approach leverages cross-attention layers in multimodal models, exemplified by BLIP-2, to extract valuable information to determine token importance. CATP employs a refined voting strategy across model heads and layers. In evaluations, CATP achieves up to 12.1 X higher accuracy than existing token pruning methods, addressing the trade-off between computational efficiency and model precision.
Loading