Call Support Copilot: A Reproducible Multimodal System for Speech Emotion Recognition, Intent Understanding, and Agent Assistance

Khoshimov Rakhmatillokhon; Dmitry Rudshin; YanYang Luo

Call Support Copilot: A Reproducible Multimodal System for Speech Emotion Recognition, Intent Understanding, and Agent Assistance

Khoshimov Rakhmatillokhon, Dmitry Rudshin, YanYang Luo

13 Mar 2026 (modified: 19 May 2026)SwissText 2026 Conference SubmissionEveryoneRevisionsCC BY 4.0

Track: Scientific Track

Keywords: speech recognition, speech emotion recognition, spoken language understanding, machine translation, reproducible NLP, customer support, multimodal systems, benchmarking

TL;DR: A reproducible customer-support copilot combines ASR, emotion recognition, translation, intent understanding, and client retrieval with strong benchmark performance and practical runtime.

Abstract: We present Call Support Copilot, a reproducible multimodal system that integrates automatic speech recognition, speech emotion recognition, machine translation, spoken language understanding, and client knowledge retrieval in a single dashboard for customer support agents. Built from publicly accessible pretrained models and standard benchmarks, the system transcribes speech with Whisper-family ASR, detects caller affect in valence-arousal-dominance terms, classifies intents from a banking-domain inventory of 77 categories, and retrieves client records from a database. Evaluation shows strong component performance: 6.6\% word error rate on LibriSpeech, 91.7\% macro-F1 on SUPERB ER session1 (IEMOCAP subset, $n$=6), 42.98 BLEU for German--English translation, and 87.0\% accuracy on BANKING77 intent classification. End-to-end benchmarking of the core pipeline achieves faster-than-real-time throughput with mean real-time factor 0.67--0.71. All model identifiers, configurations, and evaluation scripts are documented in an accompanying repository, supporting reproducibility in line with the SwissText 2026 theme.

Submission Number: 13

Loading