GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning

GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning

ICLR 2026 Conference Submission14868 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: multimodal large language model; video understanding; post-training

Abstract: Recent reinforcement learning (RL) approaches, such as outcome-supervised GRPO, have advanced Chain-of-Thought reasoning in large language models (LLMs), yet their adaptation to multimodal LLMs (MLLMs) remains underexplored. Progress has been further limited by the lack of evaluation settings that jointly test perception and reasoning under controlled generalization challenges. To enable such analysis, we reorganize prior benchmarks featuring complex real-world videos that demand intricate visual understanding and commonsense planning into **SEED-Bench-R1**, a structured testbed with large-scale training data and hierarchical evaluation across in-distribution, cross-environment, and cross-environment-task scenarios. Using this setting, we conduct a systematic experimental analysis of post-training methods, which reveals a key limitation of outcome-supervised GRPO: while it improves answer accuracy, it often compromises the logical coherence between reasoning and final answers, yielding only a 57.9\% consistency rate. This stems from optimizing exclusively for final-answer rewards, which encourages shortcuts, and from rigid KL divergence penalties, which overly constrain adaptive reasoning. To address these issues, we propose **GRPO-CARE**, a novel consistency-aware RL framework that jointly optimizes correctness and coherence without requiring explicit process supervision. GRPO-CARE introduces a two-tiered reward: (1) a base reward for accuracy, and (2) an adaptive consistency bonus derived from a slowly evolving reference model that calibrates reasoning-to-answer likelihoods within peer groups. This mechanism rewards reasoning paths that are both correct and logically consistent, while removing the constraints of KL penalties. Experiments on SEED-Bench-R1 show that GRPO-CARE consistently outperforms standard GRPO, achieving a 6.7\% gain on the hardest evaluation level and a 24.5\% increase in reasoning consistency. Moreover, models trained with GRPO-CARE transfer effectively to diverse video understanding and even language-only reasoning benchmarks, highlighting its robustness and generality.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 14868

Loading