Multimodal LLM Alignment: Challenges, Solutions, and Research Opportunities

Multimodal LLM Alignment: Challenges, Solutions, and Research Opportunities

ACL ARR 2026 January Submission692 Authors

24 Dec 2025 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Survey, Alignment, MLLM, DPO

Abstract: Multimodal large language models (MLLMs) have exhibited substantial capability in performing complex tasks that integrate visual, auditory, and textual modalities. Nevertheless, they continue to exhibit notable shortcomings in truthfulness, safety, and alignment with human preferences, thereby motivating intensive research on alignment algorithms specifically tailored to MLLMs. This paper provides a systematic and comprehensive survey of MLLM alignment, structured along three principal dimensions: (i) the algorithmic pipeline and technical procedures underlying alignment methods, (ii) the application domains and usage scenarios for which these methods are designed, and (iii) the evaluation methodologies employed to measure alignment quality. The objective of this work is to furnish researchers with a coherent framework for situating recent advances in MLLM alignment and to facilitate the development of more robust, reliable, and human-aligned multimodal systems.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: DPO,MLLM,Alignment,Survey,Alignment with Human Preference

Contribution Types: Surveys, Theory

Languages Studied: English, Chinese

Submission Number: 692

Loading