Speech Intention Classification with Multimodal Deep Learning

Yue Gu, Xinyu Li, Shuhong Chen, Jianyu Zhang, Ivan Marsic

Published: 01 Jan 2017, Last Modified: 12 May 2023Canadian Conference on AI 2017Readers: Everyone

Abstract: We present a novel multimodal deep learning structure that automatically extracts features from textual-acoustic data for sentence-level speech classification. Textual and acoustic features were first extracted using two independent convolutional neural network structures, then combined into a joint representation, and finally fed into a decision softmax layer. We tested the proposed model in an actual medical setting, using speech recording and its transcribed log. Our model achieved 83.10% average accuracy in detecting 6 different intentions. We also found that our model using automatically extracted features for intention classification outperformed existing models that use manufactured features.

0 Replies