End-to-End VideoQA with Frame Scoring Mechanisms and Adaptive Sampling

Published: 2025, Last Modified: 22 Jan 2026NLPCC (2) 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Video Question Answering (VideoQA) has emerged as a challenging frontier, requiring intricate interactions between visual and textual modalities. Simply uniformly sampling frames or indiscriminately aggregating frame-level visual features often falls short in capturing the nuanced and relevant contexts of videos to well perform VideoQA. To mitigate these issues, we propose VidF4, a novel VideoQA framework equipped with tailored frame selection strategy for effective and efficient VideoQA. We propose three frame-scoring mechanisms that consider both question relevance and inter-frame similarity to evaluate the importance of each frame for a given question on the video. Furthermore, we design a differentiable adaptive frame sampling mechanism to facilitate end-to-end training for the frame selector and answer generator. The experimental results across three widely adopted benchmarks demonstrate that VidF4 consistently outperforms existing VideoQA methods.
Loading