Improving Multi-Modal Emotion Recognition Using Entropy-Based Fusion and Pruning-Based Network Architecture Optimization

Haotian Wang, Jun Du, Yusheng Dai, Chin-Hui Lee, Yuling Ren, Yu Liu

Published: 2024, Last Modified: 29 Jan 2026ICASSP 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: In this study, we aim to improve our recent hierarchical information fusion system for multi-modal emotion recognition challenge (MER 2023) in both efficiency and performance. Specifically, we extract robust acoustic and visual representations from pre-trained models and fuse them together in different structures. Then, an entropy-based fusion approach is proposed to obtain the final prediction of emotion and valence based on multi-label predictions of all different feature fusion structures. Furthermore, to reduce the network redundancy and improve the model generalization in low-resource multi-modal data conditions, we propose a novel approach for optimizing the network structure progressively based on structured pruning and learning-rate rewinding. When tested on the dataset of MER 2023, the optimized network structure with entropy-based fusion yields consistent and significant improvements, outperforming the champion system of the MER-MULTI sub-challenge.