PixelGaze: Toward Pixel-Level Gaze Target Prediction in Natural Scenes

Dan Guo; Jingyuan Xu; Duandongxing; Feiyang Liu; Zihao He; Ruijie Liu; Meng Wang

PixelGaze: Toward Pixel-Level Gaze Target Prediction in Natural Scenes

Dan Guo, Jingyuan Xu, Duandongxing, Feiyang Liu, Zihao He, Ruijie Liu, Meng Wang

12 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: gaze following, segmentation, gaze target prediction

Abstract: Following the gaze of other people and analyzing the target they are looking at can help us understand what they are thinking, and doing, and predict the actions that may follow. Existing methods for gaze following primarily focus on gaze points or heatmaps rather than objects, making it difficult to deliver clear semantics and an accurate scope of the gaze-at targets. To address this shortcoming, we propose a novel gaze target prediction method named PixelGaze, that can effectively leverage the spatial visual field of the person as guidance, enabling a progressive coarse-to-fine process for gaze target segmentation and recognition. Specifically, a prompt-based visual foundation model serves as the encoder, working in conjunction with three distinct decoding modules (e.g. FoV perception, heatmap generation, and segmentation) to form the framework for gaze target prediction. Then, with the head bounding box performed as an initial prompt, PixelGaze obtains the FoV map, heatmap, and segmentation map progressively, leading to a unified framework for multiple tasks (e.g. direction estimation, gaze target segmentation, and recognition). In particular, to facilitate this research, we construct and release a new dataset, comprising 72k images with pixel-level annotations and 270 categories of gaze targets, built upon the GazeFollow dataset. The quantitative evaluation shows that our approach achieves the mIoU of 34.9% in gaze target segmentation and 45.1% recognition accuracy. Meanwhile, our approach also achieves state-of-the-art performance on the gaze-following task.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 4261

Loading