Zooming from Context to Cue: Hierarchical Preference Optimization for Multi-Image MLLMs

Xudong Li; Mengdan Zhang; Peixian Chen; Xiawu Zheng; Yan Zhang; Jingyuan Zheng; Yunhang Shen; Ke Li; Chaoyou Fu; Xing Sun; Rongrong Ji

Zooming from Context to Cue: Hierarchical Preference Optimization for Multi-Image MLLMs

Xudong Li, Mengdan Zhang, Peixian Chen, Xiawu Zheng, Yan Zhang, Jingyuan Zheng, Yunhang Shen, Ke Li, Chaoyou Fu, Xing Sun, Rongrong Ji

Published: 18 Sept 2025, Last Modified: 05 Feb 2026NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multi-modal Large Language Models (MLLMs), Direct Preference Optimization, Hallucination Mitigation

Abstract: Multi-modal Large Language Models (MLLMs) excel at single-image tasks but struggle with multi-image understanding due to cross-modal misalignment, leading to hallucinations (context omission, conflation, and misinterpretation). Existing methods using Direct Preference Optimization (DPO) constrain optimization to a solitary image reference within the input sequence, neglecting holistic context modeling. To address this, we propose Context-to-Cue Direct Preference Optimization (CcDPO), a multi-level preference optimization framework that enhances per-image perception in multi-image settings by zooming into visual clues—from sequential context to local details. Our approach features two sequentially dependent components: (i) Context-Level Optimization: By introducing low-cost sequence preference pairs, we optimize the model to distinguish between complete and disrupted multi-image contexts, thereby correcting cognitive biases in MLLMs’ multi-image understanding. (ii) Needle-Level Optimization: By integrating region-specific visual prompts with multimodal preference supervision, we direct the model’s attention to critical visual details, effectively suppressing perceptual biases toward fine-grained visual information. To support scalable optimization, we also construct MultiScope-42k, an automatically generated multi-image dataset with hierarchical preference pairs. Experiments show that CcDPO significantly reduces hallucinations and yields consistent performance gains across general single- and multi-image tasks. Codes are available at https://github.com/LXDxmu/CcDPO.

Primary Area: Applications (e.g., vision, language, speech and audio, Creative AI)

Submission Number: 6220

Loading