Keywords: video compression, Video Multi-Model Language Models
Abstract: Recent Video Multi-Model Language Models(VLLMs) have achieved significant progress in multimodal understanding. However, they have been challenged by high computational cost due to the massive video frames and huge video tokens generated from the video encoders. Conventional video processing often relies on uniform sampling, which together with the large number of tokens generated by visual encoders, leads to substantial redundancy in visual information. To address this issue, we propose a joint pixel-token compression (P-T) strategy to minimize computational burden. Specifically, firstly, in the terms of pixel-level compression, the similarity is evaluated by calculating the pixel-wise differences between consecutive frames. This strategy enables the selection of more semantically informative frames for better video understanding. Secondly, during token-level compression, redundancy in visual information is reduced by measuring the cosine similarity of tokens at corresponding positions between frames. By adopting this strategy, we can eliminate redundant tokens. Our model is a plug-and-play module that can be easily integrated into different baselines. We conduct extensive experiments under both training-free and training settings and achieve significant improvements (Notably, even after discarding 50\% of the visual tokens, our method yields a 0.9\% performance gain on the MVBench benchmark with the Qwen2.5-VL model), demonstrating the effectiveness of our approach.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 24220
Loading