EgoQuestions: Crafting Egocentric Questions for Egocentric Video Question Answering

Xinzhi Dong; Fang-Lue Zhang; Meng-Hao Guo

EgoQuestions: Crafting Egocentric Questions for Egocentric Video Question Answering

Xinzhi Dong, Fang-Lue Zhang, Meng-Hao Guo

06 Sept 2025 (modified: 13 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Egocentric Vision

Abstract: A thorough understanding of models’ egocentric capabilities is crucial for robotics, autonomous driving, smart glasses, etc. Egocentric VideoQA aims to assess models’ understanding of first‑person videos, but existing benchmarks often include questions that do not reliably probe recorder‑centric reasoning. Using these datasets to train and evaluate models can obscure true model capabilities and reduce the value of curated egocentric data. To address this, we define egocentric questions and propose three clear principles: a question should focus on the video recorder and their activities; it must avoid shortcut cues that allow answers via generic scene or action recognition (e.g., simultaneously naming an action and its object); while intentions and attributes may serve as shortcuts for actions and objects, those that require understanding of the recorder’s perspective will not. Guided by these principles, we build a checking pipeline to filter existing QA pairs and a crafting pipeline to generate valid egocentric questions. We release EgoQuestions, a benchmark of 2,500 curated egocentric QA instances created with our pipeline, and evaluate several proprietary and open‑source VLMs. Results reveal substantial room for improvement in current models’ egocentric capabilities and a clear performance gap (about 10%) between egocentric questions that adhere to our principles and flawed alternatives, demonstrating existing egocentric benchmarks tend to overrate models’ first-person capabilities, and the need for rigorously designed egocentric benchmarks to more accurately assess models’ first-person vision capabilities.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 2515

Loading