Abstract: Large multimodal models excel in multimodal tasks but face significant computational challenges due to excessive visual tokens. Unlike token reduction methods that focus on token-level redundancy, we identify and study the computation-level redundancy on vision tokens to ensure no information loss. Our key insight is that vision tokens from the pretrained vision encoder do not necessarily require all the heavy operations (e.g., self-attention, FFNs) in decoder-only LMMs and could be processed more lightly with proper designs. We designed a series of experiments to discover and progressively squeeze out the vision-related computation redundancy. Based on our findings, we propose ProxyV, a novel approach that utilizes proxy vision tokens to alleviate the computational burden on original vision tokens. ProxyV enhances efficiency without compromising performance and can even yield notable performance gains in scenarios with more moderate efficiency improvements. Furthermore, the flexibility of ProxyV is demonstrated through its combination with token reduction methods to boost efficiency further.
Lay Summary: In this paper, we systematically study the computation-level redundancy on vision tokens in decoder-only LMMs and explore ways to progressively reduce it. We propose ProxyV, a novel design that introduces proxy tokens to carry out heavy computations, effectively reducing computation while ensuring performance.We extensively validate the effectiveness of ProxyV with different LLMs and show its flexibility by proposing a non-spatial variant that can be directly combined with token reduction methods.
Link To Code: https://github.com/penghao-wu/ProxyV
Primary Area: Deep Learning->Large Language Models
Keywords: Large Multimodal Model, Multimodal Large Language Model, Large Multimodal Model Acceleration
Submission Number: 2984
Loading