Abstract: Human video comprehension demonstrates dynamic coordination between reasoning and visual attention, adaptively focusing on query-relevant details.
However, current long-form video question answering systems employ rigid pipelines that decouple reasoning from perception, leading to either information loss through premature visual abstraction or computational inefficiency through exhaustive processing.
The core limitation lies in the inability to adapt visual extraction to specific reasoning requirements—different queries demand fundamentally different visual evidence from the same video content.
In this work, we present CAVIA, a training-free framework that revolutionizes video understanding through reasoning-perception coordination. Unlike conventional approaches where visual processing operates independently of reasoning, CAVIA creates a closed-loop system where reasoning continuously guides visual extraction based on identified information gaps.
CAVIA introduces three innovations: (1) hierarchical reasoning-guided localization to precise frames; (2) cross-modal semantic bridging for targeted extraction; (3) confidence-driven iterative synthesis.
CAVIA achieves state-of-the-art performance on challenging benchmarks: EgoSchema (65.7\%, +5.3\%), NExT-QA (76.1\%, +2.6\%), and IntentQA (73.8\%, +6.9\%), demonstrating that dynamic reasoning-perception coordination provides a scalable paradigm for video understanding.
Paper Type: Long
Research Area: Question Answering
Research Area Keywords: Multimodality and Language Grounding to Vision, Robotics and Beyond; Question Answering
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches to low-resource settings
Languages Studied: English
Reassignment Request Area Chair: This is not a resubmission
Reassignment Request Reviewers: This is not a resubmission
A1 Limitations Section: This paper has a limitations section.
A2 Potential Risks: N/A
B Use Or Create Scientific Artifacts: Yes
B1 Cite Creators Of Artifacts: Yes
B1 Elaboration: cited all datasets (EgoSchema, NExT-QA, IntentQA) and models (Qwen2.5-VL, GPT-4, etc.) - Section 3.1 and throughout
B2 Discuss The License For Artifacts: No
B2 Elaboration: All datasets used are publicly available for research purposes.
B3 Artifact Use Consistent With Intended Use: Yes
B3 Elaboration: using datasets for VideoQA evaluation
B4 Data Contains Personally Identifying Info Or Offensive Content: N/A
B5 Documentation Of Artifacts: Yes
B5 Elaboration: Section 3.1 provides dataset descriptions
B6 Statistics For Data: Yes
B6 Elaboration: Section 3.1 reports dataset statistics (250+ hours, 5000 questions for EgoSchema, etc.)
C Computational Experiments: Yes
C1 Model Size And Budget: Yes
C1 Elaboration: mention using Qwen2.5-VL-7B and GPT models
C2 Experimental Setup And Hyperparameters: Yes
C2 Elaboration: Hyperparameters are mentioned
C3 Descriptive Statistics: Yes
C3 Elaboration: report clear performance metrics with improvements- Section 3.2
C4 Parameters For Packages: Yes
C4 Elaboration: mention using Qwen2.5-VL
D Human Subjects Including Annotators: No
D1 Instructions Given To Participants: N/A
D2 Recruitment And Payment: N/A
D3 Data Consent: N/A
D4 Ethics Review Board Approval: N/A
D5 Characteristics Of Annotators: N/A
E Ai Assistants In Research Or Writing: Yes
E1 Information About Use Of Ai Assistants: No
E1 Elaboration: We used AI assistants solely for language polishing and grammatical improvements of our original content. According to ACL's Policy on Publication Ethics (updated June 2024), AI tools used purely for language assistance - including paraphrasing or polishing the author's original content - are treated equivalently to grammar checkers and spell checkers, which do not require disclosure. Our use falls strictly within this category as we only employed AI for improving clarity and correcting language errors in text we had already conceived and written, without any AI involvement in content generation, idea development, or research methodology.
Author Submission Checklist: yes
Submission Number: 330
Loading