Omni-Perception Policy Optimization for Multimodal Emotion Reasoning

Zhiyuan Han; Beier Zhu; Wenwen Tong; Pengyang Shao; Peipei Song; Xinyi Wang; Jiangnan Chen; Lewei Lu; Xun Yang

Omni-Perception Policy Optimization for Multimodal Emotion Reasoning

Zhiyuan Han, Beier Zhu, Wenwen Tong, Pengyang Shao, Peipei Song, Xinyi Wang, Jiangnan Chen, Lewei Lu, Xun Yang

Published: 30 Apr 2026, Last Modified: 24 Jun 2026ICML 2026 regularEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We improve multimodal emotion reasoning by teaching omni-modal AI models to use visual and acoustic evidence more completely and faithfully, reducing cue underuse and cross-modal hallucination.

Abstract: We find that current emotion-oriented Omni-MLLMs still lack *reliable omni-modal perception*: they (i) underutilize multimodal cues in their reasoning trajectories and (ii) exhibit unfaithful behavior, often hallucinating modality-specific statements from other modalities. Building on these insights, we propose **OPPO** (**O**mni-**P**erception **P**olicy **O**ptimization), a reinforcement learning framework that explicitly optimizes multimodal perception. First, an Omni-Perception Reward decomposes ground-truth reasoning into fine-grained visual, acoustic, and emotion cues and rewards trajectories that semantically recover these cues. Second, an Omni-Perception Loss compares the policy under full and unimodally masked inputs, applying a KL penalty only to modality-specific evidence tokens to suppress cross-modal hallucination. We further introduce *MEP-Bench*, a diagnostic benchmark that quantifies *utilization* and *faithfulness*. Experiments show that OPPO achieves state-of-the-art performance on MER-UniBench and substantially improves utilization and faithfulness scores on MEP-Bench, highlighting the importance of sufficient and faithful omni perception for multimodal emotion reasoning.

Lay Summary: AI systems that understand videos, voices, and text can be used to recognize human emotions, but they may still make unreliable judgments. For example, a model may focus only on tone of voice while ignoring facial expressions, or claim that a visual cue exists simply because the audio sounds emotional. This paper studies these issues and introduces a benchmark to measure whether models use relevant visual and audio evidence faithfully. We then propose OPPO, a training method that encourages models to cover more emotion-related cues and reduces unsupported claims about missing or unclear modalities. Experiments show that OPPO improves both emotion recognition and the reliability of model explanations.

Originally Submitted Supplementary Material: zip

Primary Area: Applications

Keywords: Multimodal Emotion Reasoning, Multimodal Large Language Model, Reinforcement Learning

Originally Submitted PDF: pdf

Submission Number: 926

Loading