Abstract: Gaze target detection aims at determining the image location where a person is looking.
While existing studies have made significant progress in this area by regressing accurate gaze heatmaps, these achievements have largely relied on access to extensive labeled datasets, which demands substantial human labor.
In this paper, our goal is to reduce the reliance on the size of labeled training data for gaze target detection. To achieve this, we propose AL-GTD, an innovative approach that integrates supervised and self-supervised losses within a novel sample acquisition function to perform active learning (AL).
Additionally, it utilizes pseudo-labeling to mitigate distribution shifts during the training phase. AL-GTD achieves the best of all AUC results by utilizing only 40-50% of the training data, in contrast to state-of-the-art (SOTA) gaze target detectors requiring the entire training dataset to achieve the same performance.
Importantly, AL-GTD quickly reaches satisfactory performance with 10-20% of the training data, showing the effectiveness of our acquisition function, which is able to acquire the most informative samples.
We provide a comprehensive experimental analysis by adapting several AL methods for the task. AL-GTD outperforms AL competitors, simultaneously exhibiting superior performance compared to SOTA gaze target detectors when all are trained within a low-data regime.
Code is available at https://github.com/francescotonini/al-gtd.
Primary Subject Area: [Engagement] Emotional and Social Signals
Relevance To Conference: Our paper is related to the ACM MM conference, particularly focusing on the primary subject area of Emotional and Social Signals. Our focus lies on gaze target detection. As established in existing literature, including previous ACM MM, gaze signal (also called visual focus of attention) serves as a crucial cue for social interactions. Our approach augments RGB scene processing by incorporating depth information, thus, presenting a multimodal approach. The main contribution of our research lies in introducing a novel active learning (AL) strategy for multimodal gaze target detection by relying on both scene images and depth maps equally. Our AL method not only advances the field but also addresses the broader multimedia community's interest by showcasing a methodology to reduce data annotation efforts. Remarkably, our proposed pipeline achieves SOTA performance while requiring substantially less training data, thus demonstrating its efficacy and practicality.
Supplementary Material: zip
Submission Number: 1949
Loading