Optimizing Thai-English Spoken Question Answering Interaction for Open Environments with Limited Resources

Published: 2025, Last Modified: 03 Dec 2025ICDAR (3) 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: This paper introduces TamMe, an optimized Thai-English spoken question answering (QA) system designed specifically for robust real-time interactions in open environments under constrained computational resources. Unlike conventional text-based QA systems, TamMe supports natural speech interactions by effectively integrating bilingual automatic speech recognition (ASR), semantic FAQ retrieval, text-to-speech (TTS) synthesis, and multimodal avatar-driven responses. TamMe significantly enhances efficiency and accuracy through innovative techniques, including Faster-Whisper accelerated with pre-ASR language identification, reducing GPU memory usage by approximately 60% while achieving a low word error rate of 5.72% for Thai and 8.54% for English. To address noisy real-world conditions in open environment, TamMe employs spatial-cue preserving speech enhancement and multi-stage adaptive noise suppression, achieving a retrieval Precision@1 of 89.7%. Semantic retrieval performance is optimized via multilingual embedding-based indexing, ensuring accurate responses to paraphrased and multilingual spoken queries. Additionally, speaker-adaptive TTS and precomputed SadTalker-based 3D avatar animations enable visually expressive, synchronized multimodal interactions without compromising real-time responsiveness. Experimental evaluations in a realistic kiosk setup demonstrate that TamMe effectively balances accuracy, computational efficiency, and user engagement.
Loading