Multimodal Engagement Prediction in Human-Robot Interaction Using Transformer Neural Networks

Published: 01 Jan 2025, Last Modified: 09 Apr 2025MMM (5) 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Engagement estimation in human-robot interaction (HRI) is still a challenging task, as it is essential to maintain continuous spontaneous communication between humans and social robots by gauging user engagement levels. To some extent, users can pretend to be engaged in a conversation with a robot, hence it is crucial to analyse other viable cues obtainable from the video. Some recent studies have only used a single modality to estimate user engagement, particularly audio-visual ones. Meanwhile, the use of emotions has not been extensively explored, whereby it may provide critical information such as behavioural patterns and facial expressions, which allows for a better nderstanding of engagement levels. In this paper, we propose a framework that utilises Transformer-based models to demonstrate the effectiveness of a multimodal architecture to engagement prediction in HRI. Experimentation on the UE-HRI dataset, a real-life dataset of users communicating spontaneously with a social robot in a dynamic environment demonstrated the efficacy of a fully Transformer-based architecture compared to other standard models described in the existing literature. An online mode assessment showed the feasibility of predicting user engagement in real-time HRI scenarios.
Loading