Keywords: Gaze Following, Vision-Language Model
TL;DR: We present GazeVQA, the first large-scale text-image dataset for VLM-based gaze following.
Abstract: Gaze following aims to infer human intention within scene images. Conventional methods typically rely on scene and face images to regress the gaze point coordinates which is unnatural and restrictive. Recently, vision-language models (VLMs) have attracted significant attention for their powerful reasoning abilities, raising an important question: can VLMs be leveraged to advance the gaze following? In this work, we introduce GazeVQA, the first large-scale text-image dataset for VLM-based gaze following. GazeVQA is the first to provide accurate textual annotations for both observers and gaze targets, along with natural language question-answering (QA) pairs tailored for the gaze following task. The dataset contains 410K QA pairs across 102K scene images, offering rich supervision for training and evaluating VLMs. Building on GazeVQA, we establish the first benchmark for VLM-based gaze following. Experiments demonstrate that existing VLMS exhibit limited zero-shot performance on gaze following. However, with training on our dataset, their performance improves significantly, demonstrating the potential of GazeVQA to drive progress in this area. We will release the dataset and code to facilitate future research.
Primary Area: datasets and benchmarks
Submission Number: 25219
Loading