Critical Video-Language Understanding via Query-Guided Frame Selection and Visual-Query Transformation

12 Feb 2026 (modified: 20 Mar 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Recent advances in language-model-based video understanding have developed rapidly, driven by the emergence of LLMs to a great extent. Nevertheless, existing research has primarily concentrated on creating a projection mechanism converting video features into tokens—a method that is both conceptually simplistic and practically ill-performed. In this study, we propose VaQuitA, a brand new framework which more effectively unifies video representations and text inputs. At the data level, instead of sampling frames uniformly, we take advantage of a CLIP-score-based scheme, ensuring frames are more closely aligned with the query. At the feature level, we introduce a tunable Video Perceiver and a Visual-Query Transformer (VQ-Former), which together improve the synergy between the input question and the video features. In addition, we show that adding a brief prompt—specifically, ``Please be critical." improves the LLM’s ability to comprehend video content. Experiments across various benchmark datasets show that VaQuitA establishes a new state-of-the-art for zero-shot video question-answering, while also enabling high-quality multi-turn video-based dialogues with users. Code will be released.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Matthew_Walter1
Submission Number: 7477
Loading