Keywords: video-language models, frame selection, long video question answering
TL;DR: We propose ClueVQA, a novel retrieval framework enhances query-based frame retrieval for VideoQA by generating and integrating supplementary answer clues, leading to improved performance across long-form video benchmarks and various VideoLLMs.
Abstract: Video-language models have achieved notable success in understanding complex visual narratives and answering fine-grained questions about video content. However, the computational burden of processing long videos - coupled with the growing size of modern models - restricts most approaches to processing only a limited number of frames. A widely adopted strategy to address this limitation is query-based frame retrieval, where frames are selected based on their semantic similarity to the given query. While effective in many cases, such methods are primarily limited to surface-level relevance matching and can fail when faced with implicit, ambiguous, or reasoning-intensive queries, potentially overlooking critical evidence in the video. In this work, we introduce ClueVQA, a novel retrieval framework that improves upon a standard query-based approach by generating and integrating supplementary answer clues and effectively utilizing them for frame selection. The answer clues are derived from the input query and a global scan of the video, which are then used to produce a secondary scoring distribution over frames. This clue-based distribution is then fused with the original query-based frame score distribution to yield a more informed frame selection. The final selected frames are passed to an off-the-shelf Video-LLM for answer generation. Extensive experiments on long-form VideoQA benchmarks, including MLVU, LongVideoBench, and VideoMME, show that our method considerably improves performance over a standard query-based retrieval method across different Video-LLMs.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 22893
Loading