A More Efficient Inference Model for Multimodal Emotion Recognition

Liang Jia; Jin Tan; Lijin Qi; Mingwen Lin

A More Efficient Inference Model for Multimodal Emotion Recognition

Liang Jia, Jin Tan, Lijin Qi, Mingwen Lin

Published: 05 Sept 2024, Last Modified: 16 Oct 2024ACML 2024 Conference TrackEveryoneRevisionsBibTeXCC BY 4.0

Keywords: emotion analysis, multimodal, efficient, inference

Verify Author List: I have double-checked the author list and understand that additions and removals will not be allowed after the submission deadline.

Abstract: With the widespread adoption of the Internet and mobile Internet, an increasing number of individuals are expressing their emotions on short-video platforms. Contemporary multimodal emotion analysis technologies facilitate a more comprehensive recognition and understanding of emotions through the analysis of various data sources including text, facial expressions, audio, hand gestures, among others. Consequently, the significance of sentiment analysis is becoming increasingly pronounced. However, existing research indicates that most emotion analysis techniques are not sufficiently rapid and efficient in light of the exponential proliferation of short video content. In addition, most sentiment analysis models demonstrate significant differences in the contribution of each modality, with text and visual modalities often exerting a greater influence than audio modes. Furthermore, in the pursuit of heightened accuracy, certain models are designed to be exceedingly complex, while others prioritize swift reasoning at the expense of accuracy. This paper proposes a more efficient multimodal sentiment analysis model, presenting three distinct advantages. Firstly, residual-free connectivity modules capable of extracting 3-D attentional weights are proposed to process visual modal features, maintaining accuracy while improving inference efficiency. Secondly, adoption of multi-scale hierarchical context aggregation (aggregation followed by interaction) for audio modality to capture coarse- and fine-grained audio contextual information through multilevel aggregation, thereby enriching audio modality features and minimizing disparities between modalities' contributions. Finally, attainment of a superior balance between accuracy and speed, thereby enhancing adaptability to the fast-paced short video environment and meeting the burgeoning demand for video content processing.

A Signed Permission To Publish Form In Pdf: pdf

Primary Area: Applications (bioinformatics, biomedical informatics, climate science, collaborative filtering, computer vision, healthcare, human activity recognition, information retrieval, natural language processing, social networks, etc.)

Paper Checklist Guidelines: I certify that all co-authors of this work have read and commit to adhering to the guidelines in Call for Papers.

Student Author: No

Submission Number: 25

Loading