Omni-RRM: Automatic Preference & Reasoning Construction Advances Multimodal Reward Modeling

Zicheng Kong; Dehua Ma; Zhenbo Xu; AlvenYang; Yiwei Ru; Haoran Wang; Zixuan Zhou; Fuqing Bie; Liuyu Xiang; Huijia Wu; Jian Zhao; Zhaofeng He

Omni-RRM: Automatic Preference & Reasoning Construction Advances Multimodal Reward Modeling

Zicheng Kong, Dehua Ma, Zhenbo Xu, AlvenYang, Yiwei Ru, Haoran Wang, Zixuan Zhou, Fuqing Bie, Liuyu Xiang, Huijia Wu, Jian Zhao, Zhaofeng He

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Omni Reward Modeling, Reinforcement Learning

TL;DR: This paper presents Omni-RRM, a reasoning-augmented reward model, and Omni-Preference, a fully automated dataset, which together enable state-of-the-art preference modeling across images, video, and audio without manual annotation.

Abstract: Multimodal large language models (MLLMs) have shown remarkable capabilities, but their safe deployment is hindered by alignment failures. A critical bottleneck is the lack of effective reward models (RMs), which are typically limited to vision, provide opaque scalar scores, and depend on costly human annotations. To address these challenges, we introduce Omni-RRM, the first open-source, reasoning-driven reward model that provides explainable preference judgments across text, image, video, and audio. At the core of our approach is Omni-Preference, a novel, large-scale dataset constructed through a fully automated pipeline: we generate preference pairs by contrasting models of varying capabilities and then enrich them with multi-criteria, chain-of-thought rationales from powerful teacher models, completely eliminating the need for human labeling. Omni-RRM is trained in a two-stage process: supervised fine-tuning to instill the ability to generate structured rationales, followed by reinforcement learning to sharpen its judgment on difficult, low-contrast examples. Comprehensive evaluations show that Omni-RRM achieves state-of-the-art accuracy on video (80.2% on ShareGPT-V) and audio (66.8% on Audio-HH) benchmarks, while significantly outperforming existing open-source RMs on image tasks, with an overall improvement of 17.7% over its base model. Furthermore, Omni-RRM demonstrates strong generalization, boosting downstream task performance via Best-of-N selection and even improving accuracy on text-only preference benchmarks. Our data, code and models are available at https://anonymous.4open.science/r/Omni-RRM-CC08

Primary Area: foundation or frontier models, including LLMs

Submission Number: 10055

Loading