Abstract: In this study we investigate the effectiveness of deep neural networks in predicting valence and arousal solely from visual information of video sequences. Several recent Convolutional Neural Network (CNN) and Transformer architectures are used as backbone of the proposed model. We also assess the impact of pretraining on model performance by comparing the results of trained from scratch versus pre-trained models. Experimental results on the One-Minute Gradual-Emotion Recognition Challenge dataset suggest that pre-training on emotion recognition datasets is beneficial for most models. Comparison with the state-of-the-art reveals similar performance on valence Concordance Correlation Coefficient (CCC) and lower performance on arousal CCC. However, the predictions in our experiments are not statistically different in most cases. The study concludes by emphasizing the complexity of video emotion recognition and the need for further research to enhance the robustness and accuracy of emotion recognition models. The source code used for the experiments is made publicly available.
Loading