Track: Long Paper Track (up to 9 pages)
Keywords: Video Large Language Models; Video Temporal Reasoning
TL;DR: We examine the consistency of Video-LLMs in understanding temporal moments within videos, a key factor for robust video language comprehension, with various probes..
Abstract: Recent advancements in video large language models (Video-LLMs) have shown capabilities of temporally-grounding language queries or retrieving video moments in videos. However, such capabilities have not been thoroughly verified to be robust and trustable. In this study, we explore the consistency of Video-LLMs in grasping temporal moments within videos — a critical indicator for robust and trustworthy video language comprehension. Specifically, we devise different probes where Video-LLMs first predict temporal moments based on language queries, followed by verification questions to assess whether the predicted moments accurately reflect the queries. Our results show that current Video-LLMs respond unintuitively to such assessment; they often fail to provide consistent answers upon re-evaluation and even get near chance-level performance. This reveals the significant shortcomings in the current capabilities of Video-LLMs for reliable video temporal understanding, underscoring the need for further research and development in this field.
Supplementary Material: zip
Submission Number: 14
Loading