Static or Dynamic Towards Query-Adaptive Token Selection for Video Question Answering

Static or Dynamic Towards Query-Adaptive Token Selection for Video Question Answering

ACL ARR 2025 May Submission7079 Authors

20 May 2025 (modified: 29 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Video question answering benefits from the rich information available in videos, enabling a wide range of applications. However, the large volume of tokens generated from longer videos presents significant challenges to memory efficiency and model performance. To alleviate this issue, existing works propose to compress video inputs, but usually overlooking the varying importance of static and dynamic information across different queries, leading to inefficient token usage within limited budgets. To tackle this, we propose a novel token selection strategy, \textsc{explore-then-select}, that adaptively adjust static and dynamic information needed based on question requirements. Our framework first explores different token allocations between key frames, which preserve spatial details, and delta frames, which capture temporal changes. Next, it employs a query-aware attention-based metric to select the optimal token combination without model updates. Our proposed framework is plug-and-play that can be seamlessly integrated within diverse video-language models. Extensive experiments show that our method achieves significant performance improvements (up to 5.8\%) among various video question answering benchmarks. The code is accessible at the anonymous \href{https://anonymous.4open.science/r/Static\_Or\_Dynamic}{link}.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: vision question answering, video processing, multimodality

Contribution Types: NLP engineering experiment

Languages Studied: English

Submission Number: 7079

Loading