MM-RLHF: The Next Step Forward in Multimodal LLM Alignment

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We introduce MM-rlhf, a dataset of 120k fine-grained human preference pairs, and propose novel methods to significantly improve multimodal large language model alignment, achieving consistent performance gains across 10 evaluation dimensions.
Abstract:

Existing efforts to align multimodal large language models (MLLMs) with human preferences have only achieved progress in narrow areas, such as hallucination reduction, but remain limited in practical applicability and generalizability. To this end, we introduce MM-RLHF, a dataset containing 120k fine-grained, human-annotated preference comparison pairs. This dataset represents a substantial advancement over existing resources, offering superior size, diversity, annotation granularity, and quality. Leveraging this dataset, we propose several key innovations to improve both the quality of reward models and the efficiency of alignment algorithms. Notably, we introduce the Critique-Based Reward Model, which generates critiques of model outputs before assigning scores, offering enhanced interpretability and more informative feedback compared to traditional scalar reward mechanisms. Additionally, we propose Dynamic Reward Scaling, a method that adjusts the loss weight of each sample according to the reward signal, thereby optimizing the use of high-quality comparison pairs. Our approach is rigorously evaluated across 10 distinct dimensions, encompassing 27 benchmarks, with results demonstrating significant and consistent improvements in model performance (Figure.1).

Lay Summary:

We are proud to open-source MM-RLHF, a comprehensive project for aligning Multimodal Large Language Models (MLLMs) with human preferences. This release includes:

  • A high-quality MLLM alignment dataset (120K samples, created by over 50 experts over two months, including ratings and manual annotations across eight dimensions.).
  • A strong Critique-Based MLLM reward model which is trained on human annotations, achieving state-of-the-art (SOTA) performance on public benchmarks.
  • A novel alignment algorithm MM-DPO, effectively integrates reward signals to improve the data efficiency of DPO training..
  • Two new benchmarks designed for the reward model and multimodal safety, addressing gaps in existing benchmarks in these areas.

Our dataset and algorithms enable consistent performance improvements across 10 dimensions and 27 benchmarks for open-source MLLMs.

Primary Area: Deep Learning->Foundation Models
Keywords: multimodal large language models, human preferences, alignment with human preference
Submission Number: 11692
Loading