Calibrated Self-Verification for Multimodal LLMs via Advantage-Decoupled Preference Optimization

15 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: multimodal, reinforcement learning
Abstract: Recent advances in multimodal large language models (LLMs) have emerged through serial inference time scaling, which involves generating longer reasoning traces at test time, but encounters performance bottlenecks for multimodal tasks like visual grounding and GUI agents. Consequently, parallel inference time scaling emerges as an alternative approach, which parallelly generates multiple candidate solutions and selects the best one. However, existing methods either focus solely on training generators or verifiers, which limits performance improvements. We propose \textbf{ADPO}, Advantage-Decoupled Preference Optimization, an RL framework that trains a unified policy to generate answers and self-verify their quality via preference rewards and decoupled advantages, simultaneously improving the model's generation and verification capabilities. To enhance the model's verification ability, we introduce preference rewards using discrete group-adaptive ranking for binary outcomes and margin-based pairwise comparisons for continuous signals, yielding more stable learning and better-calibrated confidence scores. We find that joint training of generation and verification creates gradient interference, leading to suboptimal performance in both tasks. To address this conflict, we introduce decoupled optimization with separate advantages and cross-task loss masking, effectively improving both generation and verification capabilities, ablation studies show \textbf{+0.03} average improvement in verification AUC/AP metrics. Benefiting from our preference rewards and decoupled optimization, our method achieves superior performance on multi-modal math reasoning, image grounding and GUI agent tasks, with improvements of \textbf{+2.6\%/+3.9\%} on MathVista/MMMU, \textbf{+1.8\%/+2.0\%} gIoU/cIoU on ReasonSeg, \textbf{+2.3\%/+0.7\%} grounding accuracy with \textbf{+1.4\%/+0.7\%} task success on AndroidControl/GUIOdyssey.
Primary Area: reinforcement learning
Submission Number: 6074
Loading