Multimodal emotion recognition using cross modal audio-video fusion with attention and deep metric learning
Abstract: Highlights•A multimodal emotion recognition framework with various self-attention mechanisms.•An audio-video fusion strategy which uses cross-attention.•A learnable emotional metric that extends the traditional triplet loss function.•An extensive objective evaluation is performed on RAVDESS and CREMA-D datasets.
Loading