Keywords: Large Vision-Language Models, Egocentric, Exocentric, Visual Question Answering
TL;DR: We introduce E3VQA, a multi-view QA benchmark, and M3CoT, a prompting method that improves LVLM reasoning by combining egocentric and exocentric views.
Abstract: Large vision-language models (LVLMs) are increasingly deployed in interactive applications such as virtual and augmented reality, where a first-person (egocentric) view captured by head-mounted cameras serves as key input.
While this view offers fine-grained cues about user attention and hand-object interactions, its narrow field of view and lack of global context often lead to failures on spatially or contextually demanding queries.
To address this, we introduce a framework that augments egocentric inputs with third-person (exocentric) views, providing complementary information such as global scene layout and object visibility to LVLMs.
We present E3VQA, the first benchmark for multi-view question answering with 4K high-quality question-answer pairs grounded in synchronized ego-exo image pairs.
Additionally, we propose M3CoT, a training-free prompting technique that constructs a unified scene representation by integrating scene graphs from three complementary perspectives.
M3CoT enables LVLMs to reason more effectively across views, yielding consistent performance gains (4.84\% for GPT-4o and 5.94\% for Gemini 2.0 Flash) over a recent CoT baseline.
Our extensive evaluation reveals key strengths and limitations of LVLMs in multi-view reasoning and highlights the value of leveraging both egocentric and exocentric inputs.
The dataset and source code are available at [https://github.com/Leeinsu1/Towards-Comprehensive-Scene-Understanding](https://github.com/Leeinsu1/Towards-Comprehensive-Scene-Understanding).
Supplementary Material: zip
Primary Area: Applications (e.g., vision, language, speech and audio, Creative AI)
Submission Number: 5305
Loading