Multimodal evaluation of customer satisfaction from voicemails using speech and language representations

Luis Felipe Parra-Gallego, Tomás Arias-Vergara, Juan Rafael Orozco-Arroyave

Published: 01 Jan 2025, Last Modified: 20 May 2025Digit. Signal Process. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Customer satisfaction (CS) evaluation in call centers is essential for assessing service quality but commonly relies on human evaluations. Automatic evaluation systems can be used to perform CS analyses, enabling the evaluation of larger datasets. This research paper focuses on CS analysis through a multimodal approach that employs speech and language representations derived from the real-world voicemails. Additionally, given the similarity between the evaluation of a provided service (which may elicit different emotions in customers) and the automatic classification of emotions in speech, we also explore the topic of emotion recognition with the well-known corpus IEMOCAP which comprises 4-classes corresponding to different emotional states. We incorporated a language representation with word embeddings based on a CNN-LSTM model, and three different self-supervised learning (SSL) speech encoders, namely Wav2Vec2.0, HuBERT, and WavLM. A bidirectional alignment network based on attention mechanisms is employed for synchronizing speech and language representations. Three different fusion strategies are also explored in the paper. According to our results, the GGF model outperformed both, unimodal and other multimodal methods in the 4-class emotion recognition task on the IEMOCAP dataset and the binary CS classification task on the KONECTADB dataset. The study also demonstrated superior performance of our methodology compared to previous works on KONECTADB in both unimodal and multimodal approaches.