Joint Attention Estimation during Multi-party Facilitation Using Multi-Modal Fusion

Published: 2024, Last Modified: 13 Nov 2025HRI (Companion) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: This paper presents an enhanced framework for joint attention estimation. Visual attention is an important non-verbal cue to facilitate human-human social interaction. For example, it is natural for humans to look at the person who is speaking, and to make eye contact to indicate their interest in conversation. This unique capability leads to effective social interaction between humans, which is desirable in many intelligent systems to realize natural human-robot interaction. However, it is difficult to replicate humans' social gaze saliency on agents/robots because it is highly dependent on the dynamic exchanges between humans and the immediate physical environment. Existing off-the-shelf commercial social robots usually boast of gaze behavior to enhance interaction with heuristic rules by looking at human face and who is speaking. Unfortunately, these methods are not likely to generalize well to the real-world environment where social interaction is complicated due to environment contexts and social dynamics. In this paper, we aim to address the above limitations by developing a multi-modal social gaze saliency method based on human-human interaction data in a multiparty facilitation scenario. We integrates talking information, upper-body body language, gaze angle, and head position to estimate joint attention. Such model using the multi-modal fusion is expected to generalize better to different robotic platforms and applications in the wild.
Loading