BOLT: Fewer Tokens but More Performance Retention for Efficient Vision-Language Models Inference
Abstract: Vision-Language Models (VLMs) have achieved significant advances across various downstream tasks. However, as their performance improves, the increasing number of parameters results in slower prefilling speeds and longer inference times. To overcome these limitations, we observe that most VLMs do not require a large number of image tokens for inference, we propose BOLT (\textbf{B}asis-\textbf{O}riented \textbf{L}ightweight \textbf{T}oken-Trimming), a training-free and cross-attention-free token compression method. Unlike existing approaches, BOLT addresses the challenge of insufficient visual cues in textual prompts by leveraging token internal data distributions. We categorize tokens into three types: key tokens, proxy tokens, and remaining tokens. Then, by applying basis space similarity, we merge and filter the remaining tokens with the proxy tokens to retain the most informative ones. To account for the differences in VLM architectures and model sizes, we evaluate BOLT on LLaVA-Next-Llama3 and LLaVA-1.5 (7B and 13B). Our results show that BOLT achieves state-of-the-art performance, with a 90% token compression ratio leading to a 3.3× increase in pre-filling speed and a 1.5× improvement in inference speed, outperforming other methods.
Loading