Keywords: video understanding, multi-modal large language model, frame sampling
Abstract: Due to the limited context window of Multi-modal Large Language Models (MLLMs), processing an entire video is infeasible. As a result, prevailing models mostly sample a subset of frames to substitute for the full video as input to the model.
However, existing sampling methods, from simple uniform sampling to more refined methods based on query-frame relevance, employ a fixed sampling strategy that does not vary with input, failing to adapt to the diverse nature of queries, as some require comprehension of the video in its entirety, while others focus on events within short temporal segments.
To address this limitation, we propose Hierarchical Adaptive Sampling (HAS), a two-stage frame sampling framework. In the first stage, Backbone Frame Construction, we apply a Determinantal Point Process (DPP) to sample frames that are both query-relevant and non-redundant. These selected frames form the backbone of the entire sampling set, providing the foundational structure for subsequent enrichment. In the second stage, Adaptive Contextual Enrichment, we analyze the temporal distribution of the backbone frames to infer the query type and adaptively allocate the remaining frame budget between Local and Global Context. The local context enriches the backbone with fine-grained temporal dynamics and short-range causal relations, whereas the global context provides a holistic view of the entire video to enhance broader contextual understanding.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 10647
Loading