Beyond Modality Collapse: Taming Guided Modality Entropy for Omni-modal Emotion Reasoning

ACL ARR 2026 January Submission8732 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Omni-modal Large Language Models, Emotion Recognition and Reasoning
Abstract: Omni-modal Large Language Models (OLLMs) excel in diverse tasks but struggle with complex emotional reasoning, which requires integrating textual, visual, and acoustic signals. We attribute this limitation to modality collapse, where models over-rely on a dominant modality while neglecting complementary cues. To address this issue, we introduce OmniCoT, a data paradigm that interleaves guided tokens (e.g., [vision], [audio]) into reasoning traces to enforce structured evidence extraction. To further internalize the reasoning behaviors instilled by OmniCoT and facilitate adaptive modality prioritization, we propose Dynamic Modality-Entropy GRPO (DyME-GRPO), which utilizes entropy-based uncertainty estimates over Guided Tokens (GTs) to regulate modality usage, thereby mitigating collapse and informational redundancy. By applying supervised fine-tuning with OmniCoT followed by DyME-GRPO, we develop EmoOmni based on the Qwen2.5-Omni-7B backbone. Extensive experiments demonstrate that EmoOmni achieves state-of-the-art performance on multiple emotion recognition and reasoning benchmarks while preserving the general capabilities of the base model. These findings highlight the potential of our work for omni-modal reasoning across a broader range of complex tasks.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: Multimodality, Cross-modal application,Speech and vision
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 8732
Loading