A Cross-Multi-modal Fusion Approach for Enhanced Engagement Recognition

Published: 01 Jan 2024, Last Modified: 28 Mar 2025SPECOM (2) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Engagement recognition enables more natural and responsive human-computer interaction by allowing systems to monitor and adapt to a person’s engagement levels. However, developing efficient real-time engagement recognition systems remains challenging. This research proposes a multi-modal engagement recognition approach enhanced with affective embeddings to address current limitations. Several computationally efficient deep learning models are developed to process facial, body, and emotional cues from video. Additionally, a novel cross-multi-modal fusion approach is applied to combine various modalities using a cross-attention mechanism. We conducted extensive experiments on two datasets to analyze the impact of temporal context, showing that longer sequences significantly improve recognition performance. Furthermore, the results demonstrate that the proposed multi-modal approach achieves notably higher efficiency than individual modalities and outperforms modern engagement recognition frameworks, having comparable recognition performance with the winner of the Multimediate’23 challenge. Thus, by appropriately modeling visual engagement dynamics, the introduced multi-modal framework enhances real-time engagement recognition to advance human-computer interactions.
Loading