EgoThink: Evaluating First-Person Perspective Thinking Capability of Vision-Language Models

Sijie Cheng

EgoThink: Evaluating First-Person Perspective Thinking Capability of Vision-Language Models

Sijie Cheng

Published: 02 Aug 2024, Last Modified: 12 Nov 2024WiNLP 2024EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Vision-Language Model, First-person perspective

Abstract: Vision-language models (VLMs) have recently shown promising results in traditional downstream tasks. The capability of VLMs to ``think'' from a first-person perspective, a crucial attribute for advancing autonomous agents and robotics, remains largely unexplored. To bridge this research gap, we introduce EgoThink, a novel visual question-answering benchmark that encompasses six core capabilities with twelve detailed dimensions. The benchmark is constructed using selected clips from egocentric videos, with manually annotated question-answer pairs containing first-person information. To comprehensively assess VLMs, we evaluate twenty-one popular VLMs on EgoThink. Moreover, given the open-ended format of the answers, we use GPT-4 as the automatic judge to compute single-answer grading. Experimental results indicate that although GPT-4V leads in numerous dimensions, all evaluated VLMs still possess considerable potential for improvement in first-person perspective tasks.

Submission Number: 3

Loading