Catch Your Emotion: Sharpening Emotion Perception in Multimodal Large Language Models

Yiyang Fang; Jian Liang; Wenke Huang; He Li; Kehua Su; Mang Ye

Catch Your Emotion: Sharpening Emotion Perception in Multimodal Large Language Models

Yiyang Fang, Jian Liang, Wenke Huang, He Li, Kehua Su, Mang Ye

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 spotlightposterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We propose SEPM to enhance MLLMs' emotion recognition by refining classification through a two-stage inference and reducing visual redundancy, offering a scalable, resource-efficient solution.

Abstract: Multimodal large language models (MLLMs) have achieved impressive progress in tasks such as visual question answering and visual understanding, but they still face significant challenges in emotional reasoning. Current methods to enhance emotional understanding typically rely on fine-tuning or manual annotations, which are resource-intensive and limit scalability. In this work, we focus on improving the ability of MLLMs to capture emotions during the inference phase. Specifically, MLLMs encounter two main issues: they struggle to distinguish between semantically similar emotions, leading to misclassification, and they are overwhelmed by redundant or irrelevant visual information, which distracts from key emotional cues. To address these, we propose Sharpening Emotion Perception in MLLMs (SEPM), which incorporates a Confidence-Guided Coarse-to-Fine Inference framework to refine emotion classification by guiding the model through simpler tasks. Additionally, SEPM employs Focus-on-Emotion Visual Augmentation to reduce visual redundancy by directing the attention of models to relevant emotional cues in images. Experimental results demonstrate that SEPM significantly improves MLLM performance on emotion-related tasks, providing a resource-efficient and scalable solution for emotion recognition.

Lay Summary: As AI models become more powerful in understanding images and language together, they still struggle to understand human emotions accurately. For example, these models often confuse similar emotions like "joy" and "excitement," or get distracted by irrelevant parts of an image. To solve this, we propose a new method called SEPM that helps AI focus better on emotional content—without needing more training or labeled data. First, our method breaks down the emotion detection task into simpler steps, asking the model to classify whether the emotion is positive or negative before identifying the exact feeling. Then, we refine the image the AI sees by removing distracting details and guiding its attention to the parts that matter most for understanding emotion. Our approach makes these AI models better at detecting emotions, faster to use, and easier to apply in real-world situations like mental health monitoring or content moderation.

Link To Code: https://github.com/fuyyyyy/SEPM

Primary Area: Deep Learning->Large Language Models

Keywords: Multimodal Large Language Models, Emotion Recognition, Training-Free

Submission Number: 2164

Loading