Scaling Video-Language Models to 10K Frames via Hierarchical Differential Distillation

Chuanqi Cheng; Jian Guan; Wei Wu; Rui Yan

Scaling Video-Language Models to 10K Frames via Hierarchical Differential Distillation

Chuanqi Cheng, Jian Guan, Wei Wu, Rui Yan

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY-NC-SA 4.0

TL;DR: We enhance the model's performance in understanding long videos through hierarchical compression.

Abstract: Long-form video processing fundamentally challenges vision-language models (VLMs) due to the high computational costs of handling extended temporal sequences. Existing token pruning and feature merging methods often sacrifice critical temporal dependencies or dilute semantic information. We introduce differential distillation, a principled approach that systematically preserves task-relevant information while suppressing redundancy. Based on this principle, we develop ViLAMP, a hierarchical video-language model that processes hour-long videos at "mixed precision" through two key mechanisms: (1) differential keyframe selection that maximizes query relevance while maintaining temporal distinctiveness at the frame level and (2) differential feature merging that preserves query-salient features in non-keyframes at the patch level. Hence, ViLAMP retains full information in keyframes while reducing non-keyframes to their most salient features, resembling mixed-precision training. Extensive experiments demonstrate ViLAMP's superior performance across five video understanding benchmarks, particularly on long-form content. Notably, ViLAMP can process ultra-long videos (up to 10K frames) on a single NVIDIA A100 GPU, achieving substantial computational efficiency while maintaining state-of-the-art performance. Code and model are available at https://github.com/steven-ccq/ViLAMP.

Lay Summary: We developed ViLaMP, an AI model that can understand videos up to three hours long—all using just a single standard graphics card (A100 GPU). The idea is inspired by how humans watch videos: we pay close attention to important scenes and quickly skim through the rest. ViLaMP does something similar by using two smart techniques: (1) It identifies the most important moments in a video based on the task at hand and (2) It summarizes less important parts without losing their meaning. With these techniques, ViLaMP not only reduces computing costs but also beats other models on five major video understanding benchmarks. This makes it a practical and accurate tool for analyzing long videos, striking the right balance between detail and efficiency.

Link To Code: https://github.com/steven-ccq/ViLAMP

Primary Area: Deep Learning->Foundation Models

Keywords: Vision Language Models, Video Understanding

Submission Number: 3118

Loading