Applying Segment-Level Attention on Bi-Modal Transformer Encoder for Audio-Visual Emotion Recognition

Jia-Hao Hsu, Chung-Hsien Wu

Published: 01 Jan 2023, Last Modified: 12 May 2024IEEE Trans. Affect. Comput. 2023EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Emotions can be expressed through multiple complementary modalities. This study selected speech and facial expressions as modalities by which to recognize emotions. Current audiovisual emotion recognition models perform supervised learning using signal-level inputs. Such models are presumed to characterize the temporal relationships in signals. In this study, supervised learning was performed on segment-level signals, which are more granular than signal-level signals, to precisely train an emotion recognition model. Effectively fusing multimodal signals is challenging. In this study, sequential segments of audiovisual signals were obtained, and features were extracted and applied to estimate segment-level attention weights according to the emotional consistency of the two modalities using a neural tensor network. A proposed bimodal Transformer Encoder was trained using signal-level and segment-level emotion labels in which temporal context was incorporated into the signals to improve upon existing emotion recognition models. In bimodal emotion recognition, the experimental results demonstrated that the proposed method achieved 74.31% accuracy (3.05% higher than the method of fusing correlation features) on the audio-visual emotion dataset BAUM-1, which is based on fivefold cross-validation, and 76.81% accuracy (2.57% higher than the Multimodal Transformer Encoder) on the multimodal emotion data set CMU-MOSEI, which is composed of training, validation, and testing sets.