Online Multimodal End-of-Turn Prediction for Three-party Conversations

Meng-Chen Lee, Zhigang Deng

Published: 04 Nov 2024, Last Modified: 12 Nov 2025OpenReview Archive Direct UploadEveryoneCC BY 4.0

Abstract: Predicting end-of-turn in multiparty conversations is crucial to increase the usability and natural flow of spoken dialogue systems, offering substantial enhancements to conversational agents. We present a novel window-based method to predict end-of-turn moments in real-time in multiparty conversations, by leveraging the capabilities of cutting-edge pre-trained language models (PLMs) and recurrent neural networks (RNN). Our method fuses the distilBERT language model with a Gated Recurrent Unit (GRU) to accurately predict end-of-turn points in an online fashion. Our approach can significantly outperform conventional Inter-Pausal Unit (IPU)-based prediction methods that often overlook the nuances of overlap and interruption during dynamic conversations. Potential applications of this study are significant, particularly in the domains of virtual agents and human-robot interactions. Our accurate online end-of-turn prediction model can be facilitated to enhance the user experience in these applications, making them more natural and seamlessly integrated into real-world conversations.