TransGOP-R: Transformer-based Real-World Gaze Object Prediction

Guangyu Guo, Chenxi Guo, Zhaozhong Wang, Binglu Wang

Published: 01 Jan 2025, Last Modified: 08 Dec 2025IEEE Transactions on MultimediaEveryoneRevisionsCC BY-SA 4.0
Abstract: The goal of gaze object prediction (GOP) is to predict human gaze objects and categories. However, existing methods require additional head priors or filter the results before evaluation, which is an obstacle for real-world applications. To this end, this paper proposes a Transformer-based Gaze Object Prediction under Real-world setting (TransGOP-R), which does not rely on any head prior input and evaluates end-to-end. We first design a head location module to generate human head location information from a head query. Then, an error analysis demonstrates that the primary error source of the existing GOP model is in gaze estimation, which is caused by the difficulty in predicting gaze points by directly regressing heatmaps. Therefore, we introduce cone prediction into the model training stage, allowing the middle-layer features of the gaze regressor to build the relationship between the target human and objects before regressing the gaze point. An oriented gradient mechanism is proposed in this process to ensure the object detection performance is not affected by cone information. Finally, we conducted very detailed and sufficient experiments to verify the superiority of our method on the GOO-Synth and GOO-Real datasets. At the same time, we also achieve advantages compared to the human-target gaze estimation methods on the GazeFollowing, VideoAttentionTarget, and ChildPlay datasets.
Loading