Towards Engagement Prediction: A Cross-Modality Dual-Pipeline Approach using Visual and Audio Features

Deepak Kumar, Surbhi Madan, Pradeep Singh, Abhinav Dhall, Balasubramanian Raman

Published: 2024, Last Modified: 13 Nov 2024ACM Multimedia 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Engagement estimation is crucial for advancing natural human-computer interaction, allowing artificial agents to dynamically adjust their responses based on user engagement levels and creating more intuitive and immersive experiences. Despite advancements in automating real-time engagement estimation, challenges persist in real-world scenarios due to the complex nature of multi-modal human social signals. This paper proposes a novel cross-modality fusion-based methodology to address these challenges by leveraging multi-modal data. Our approach integrates visual and audio features, such as facial motion, acoustic characteristics, Contrastive Language-Image Pretraining (CLIP), and semantic embeddings. These features first pass through a transformer encoder, are then combined and processed through a cross-modal fusion mechanism, ensuring robust integration. The final integrated features are then used to predict engagement scores. This hierarchical and self-normalizing approach enhances the accuracy of engagement estimation by effectively capturing dependencies within and between modalities. The experiments are conducted on multimediate's NoXI and MPIIGroupInteraction datasets and the results demonstrates competitive performance in estimating engagement levels, addressing the complex, context-dependent nature of human engagement. Specifically, our approach achieves a Global Concordance Correlation Coefficient (CCC) score approximately (56.1%) higher than the baseline. This work contributes to developing more intelligent and responsive artificial systems, enhancing user experiences across various interactive applications.