Abstract: Spatio-temporal features are information unique to video compared to images, and are an important source of features for video content analysis. The introduction of spatio-temporal features has brought a significant change to affective computing, which is no longer limited to single-image and ignores temporal affective changes. However, existing affective computing models still do not introduce spatio-temporal features. For example, facial emotion classification still uses single-image as input to the model and ignores spatio-temporal features. Therefore, we propose a novel neural network that takes video as the input of the model and extracts feature information containing spatio-temporal features in it. The extracted feature information consists of two modalities, which are audio spatio-temporal feature information and image spatio-temporal feature information. Decision-level fusion of the two modal information is performed to achieve emotion classification. Where a feature extractor based on VGG implementation is used as the audio emotion feature extraction module, which extracts the spectrogram feature information while focusing on its temporal feature information. And using the feature extractor implemented based on I3D as the image emotion feature extraction module, which consists of 3D convolutional blocks. Compared with 2D convolutional blocks, it not only focuses on image features, but also performs convolutional extraction for both spatial features and temporal features of the images within the blocks to obtain image features containing spatio-temporal features. By fusing multiple spatio-temporal feature classifiers in the best way, our model achieves an accuracy of 62.41%. The experimental results show that feature information containing spatio-temporal features is more effective in sentiment classifiers than feature information containing only image features.
0 Replies
Loading