Leveraging Multi-Modal Saliency and Fusion for Gaze Target Detection

Athul Mathew; Arshad Khan; Thariq Khalid; Faroq AL-Tam; Riad Souissi

Leveraging Multi-Modal Saliency and Fusion for Gaze Target Detection

Athul Mathew, Arshad Khan, Thariq Khalid, Faroq AL-Tam, Riad Souissi

Published: 27 Oct 2023, Last Modified: 18 Nov 2023Gaze Meets ML 2023 PosterEveryoneRevisionsBibTeX

Submission Type: Full Paper

Keywords: gaze target detection, gaze-following, 3D gaze, free-viewing, saliency, depth map, 3D projection, point cloud, multi-modal, fusion

Abstract: Gaze target detection (GTD) is the task of predicting where a person in an image is looking. This is a challenging task, as it requires the ability to understand the relationship between the person's head, body, and eyes, as well as the surrounding environment. In this paper, we propose a novel method for GTD that fuses multiple pieces of information extracted from an image. First, we project the 2D image into a 3D representation using monocular depth estimation. We then extract a depth-infused saliency module map, which highlights the most salient ($\textit{attention-grabbing}$) regions in image for the subject in consideration. We also extract face and depth modalities from the image, and finally fuse all the extracted modalities to identify the gaze target. We quantitatively evaluated our method, including the ablation analysis on three publicly available datasets, namely VideoAttentionTarget, GazeFollow and GOO-Real, and showed that it outperforms other state-of-the-art methods. This suggests that our method is a promising new approach for GTD.

Supplementary Material: zip

Submission Number: 15

Loading