Gaze Following in Question Answering: A Comprehensive Benchmark for Vision-Language Models

Gaze Following in Question Answering: A Comprehensive Benchmark for Vision-Language Models

ICLR 2026 Conference Submission25219 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Gaze Following, Vision-Language Model

TL;DR: We present GazeVQA, the first large-scale text-image dataset for VLM-based gaze following.

Abstract: Gaze following aims to infer human intention within scene images. Conventional methods typically rely on scene and face images to regress the gaze point coordinates which is unnatural and restrictive. Recently, vision-language models (VLMs) have attracted significant attention for their powerful reasoning abilities, raising an important question: can VLMs be leveraged to advance the gaze following? In this work, we introduce GazeVQA, the first large-scale text-image dataset for VLM-based gaze following. GazeVQA is the first to provide accurate textual annotations for both observers and gaze targets, along with natural language question-answering (QA) pairs tailored for the gaze following task. The dataset contains 410K QA pairs across 102K scene images, offering rich supervision for training and evaluating VLMs. Building on GazeVQA, we establish the first benchmark for VLM-based gaze following. Experiments demonstrate that existing VLMS exhibit limited zero-shot performance on gaze following. However, with training on our dataset, their performance improves significantly, demonstrating the potential of GazeVQA to drive progress in this area. We will release the dataset and code to facilitate future research.

Primary Area: datasets and benchmarks

Submission Number: 25219

Loading