Natural Language Dialog System Considering Speaker's Emotion Calculated from Acoustic Features

Takumi Takahashi, Kazuya Mera, Tang Ba Nhat, Yoshiaki Kurosawa, Toshiyuki Takezawa

Published: 2016, Last Modified: 29 Jul 2025IWSDS 2016EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: With the development of Interactive Voice Response (IVR) systems , people can not only operate computer systems through task-oriented conversation but also enjoy non-task-oriented conversation with the computer. When an IVR system generates a response, it usually refers to just verbal information of the user’s utterance. However, when a person gloomily says “I’m fine,” people will respond not by saying “That’s wonderful” but “Really?” or “Are you OK?” because we can consider both verbal and non-verbal information such as tone of voice, facial expressions, gestures, and so on. In this article, we propose an intelligent IVR system that considers not only verbal but also non-verbal information. To estimate a speaker’s emotion (positive, negative, or neutral), 384 acoustic features extracted from the speaker’s utterance are utilized to machine learning (SVM). Artificial Intelligence Markup Language (AIML)-based response generating rules are expanded to be able to consider the speaker’s emotion. As a result of the experiment, subjects felt that the proposed dialog system was more likable, enjoyable, and did not give machine-like reactions.