Rethinking Temporal Context in Video-QA: A Comprehensive Study of Single-Frame Static Bias

Tianming Liang, Linhui Li, Jian-Fang Hu, Xiangyang Yu, Wei-Shi Zheng, Jianhuang Lai

Published: 01 Jan 2025, Last Modified: 16 Oct 2025IEEE Trans. Multim. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Video question answering (Video-QA) has emerged as a core task in the vision-language domain, which requires the models to understand a given video and answer textual questions related to the video. Compared to conventional image-language tasks, Video-QA is designed for improving the models' capacity of memorizing and integrating multi-frame temporal cues associated with the questions. While significant performance improvements have recently been witnessed on public benchmarks, in this work, we rethink whether these improvements truly stem from better understanding of video temporal context as expected. To this end, we accomplish a strong single-frame baseline model trained with knowledge distillation. With this model, we surprisingly find that visiting only one single frame, without incorporating multi-frame and temporal information, is sufficient to achieve state-of-the-art (SOTA) performance on multiple mainstream benchmarks. This finding reveals the prevalence of single-frame bias in current benchmarks for the first time. Around the single-frame bias, we conduct an in-depth analysis on multiple popular benchmarks, which demonstrate that: (i) merely relying on one frame is able to achieve comparable performance with SOTA temporal Video-QA models; (ii) simply ensembling the prediction scores of only 3 separate frames is able to surpass temporal SOTAs. Furthermore, we observe that most of the benchmarks are biased towards central segments, and even the latest benchmarks tailored for temporal reasoning still suffer from severe single-frame bias. In case study, we find two key properties of low-bias instances: the question emphasizes temporal dependency and contextual understanding, and the associated video content presents significant variability in scenes, actions or interactions. Through further analysis on compositional reasoning datasets, we find that constructing explicit object/event interactions upon videos to fill in well-designed temporal question templates can effectively reduce the single-frame bias during annotation. We hope our analysis helps facilitate future efforts in the field towards mitigating static bias and highlighting temporal reasoning.

External IDs:dblp:journals/tmm/LiangLHYZL25