ReFineVQA: Iterative Refinement of Video Description via Feedback Generation for Video Question Answering
Abstract: Video question answering is a non-trivial task that demands joint understanding of visual contents and linguistic questions as well as temporal reasoning across video frames. Recent agent-based approaches address this by conducting multi-step reasoning with large language models (LLMs) across frame-level captions generated by visionlanguage models, but encounter limited temporal coherence
across frames. A possible direction based on video language models (VideoLMs) directly captures temporal dynamics via video-level descriptions, but often lacks finegrained visual cues due to a restricted number of input frames and a large dependency on input prompts. To tackle these challenges, we propose RefineVQA, a training-free framework that can easily be plugged into existing VideoLMs with iterative, LLM-guided description refinements. Specifically, the VideoLM produces an initial description, followed by LLM feedback determining whether the description suffices for the question and guiding further visual extraction, which in turn enhances the description quality while preserving temporal context. Plugged into stateof-the-art VideoLMs, ReFineVQA yields consistent gains across diverse benchmarks–NExT-QA, EgoSchema, VideoMME, ActivityNet, and StreamingBench–even with a small external LLM of 3.8B parameters.
Loading