Privileged Information-Guided Multitask Mutualistic Transformer for Gaze Prediction

Wenhe Chen, Yuan Chai, Xiaojun Wu, Hongjin Zhu, Qian Yu, Zhuo-Ming Du, Feilong Han, Wei Gao, Caixia Zheng, Honghui Fan

Published: 01 Jan 2025, Last Modified: 05 Nov 2025IEEE Trans. Multim. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Predicting where people look is a crucial for understanding human intentions. Gaze prediction, as a research hotspot, has evolved from predicting the gaze of a single person to simultaneously predicting the positions of all individuals and their corresponding gaze targets. However, the study of the correlation between humans and gaze as two interdependent tasks has largely been neglected. In this paper, inspired by the concept of “mutualistic symbiosis” in ecology, we propose a novel multitask mutualistic transformer (MMTR). MMTR captures paired dependencies by establishing information communication between different branches, thereby enabling comprehensive and interpretable gaze analysis for all individuals and gaze targets. Specifically, we first utilize a transformer encoder to capture the common features of all the tasks. Then, we design a mutualistic attention mechanism (MAM) in the dual-branch Transformer decoder to establish cross-task information interaction. The MAM can learn privileged information from other tasks that is helpful for the current task, thereby guiding the current branch to learn the most valuable and distinctive features. To the best of our knowledge, this is the first time that privileged information has been introduced into the gaze estimation task. Furthermore, to more flexibly learn pixel locality and long-range semantic dependencies for different tasks, we construct and embed a learnable global-local position encoding (GLPE) in different branches of MMTR. Experiments demonstrate that our proposed MMTR can guide the two branches to communicate through privileged information, effectively solve the information asymmetry problem between human detection and gaze prediction, and significantly outperform state-of-the-art gaze prediction methods on two standard benchmark datasets GazeFollowing and VideoAttentionTarget.

External IDs:dblp:journals/tmm/ChenCWZYDHGZF25