Enhancing Emotion Reasoning for Image Multi-Emotion Prediction

Published: 01 Jan 2025, Last Modified: 01 Aug 2025ICASSP 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Image multi-emotion prediction aims to identify the emotions evoked by images in humans. In the real world, individual cognitive differences can lead to different viewers experiencing varied emotions. Most existing researchers primarily focus on analyzing image features, which are limited to the perceptual level, leading to a superficial understanding of emotions. To address this gap, we propose an Emotional Reasoning Chain (EReC) based on a multimodal large language model, which learns both perception and reasoning abilities for multi-emotion prediction. Specifically, we design a parameter-efficient fine-tuning paradigm encompassing three task instructions: perception, reasoning, and prediction instruction. This paradigm involves a progressive process to perform targeted instructions within a domain for fine-tuning, thereby optimizing the model’s capabilities in image perception, cognitive reasoning, and multi-emotion prediction. Furthermore, to alleviate hallucinations of large models, a Reason-level Alignment Score (RAS) is introduced to guide the model toward closer alignment with human cognition. Experiments on four datasets demonstrate that our EReC method, which emulates the human cognitive process of viewing and interpreting images, employs progressive tuning to integrate the perceptual, reasoning, and predictive capabilities of large models, achieving superior performance.
Loading