Keywords: Survey, Alignment, MLLM, DPO
Abstract: Multimodal large language models (MLLMs) have exhibited substantial capability in performing complex tasks that integrate visual, auditory, and textual modalities. Nevertheless, they continue to exhibit notable shortcomings in truthfulness, safety, and alignment with human preferences, thereby motivating intensive research on alignment algorithms specifically tailored to MLLMs.
This paper provides a systematic and comprehensive survey of MLLM alignment, structured along three principal dimensions: (i) the algorithmic pipeline and technical procedures underlying alignment methods, (ii) the application domains and usage scenarios for which these methods are designed, and (iii) the evaluation methodologies employed to measure alignment quality. The objective of this work is to furnish researchers with a coherent framework for situating recent advances in MLLM alignment and to facilitate the development of more robust, reliable, and human-aligned multimodal systems.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: DPO,MLLM,Alignment,Survey,Alignment with Human Preference
Contribution Types: Surveys, Theory
Languages Studied: English, Chinese
Submission Number: 692
Loading