Abstract: Multimodal Large Language Models (MLLMs) have demonstrated impressive potential in handling complex tasks involving visual, auditory, and textual data. However, critical issues related to truthfulness, safety, and alignment with human preferences remain insufficiently addressed. This gap has spurred the emergence of various alignment algorithms. Recent studies have shown that alignment algorithms are a powerful approach to resolving the aforementioned challenges. In this paper, we aim to provide a comprehensive and systematic review of MLLM alignment algorithms. Specifically, we address four critical questions: (1) What application scenarios do existing alignment algorithms cover? (2) How are alignment datasets constructed? (3) How are alignment algorithms evaluated? (4) What are the future directions for the development of alignment algorithms? This work seeks to help researchers organize current advancements in the field and inspire better alignment methods.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: Multimodal Large Language Model, MLLM Alignment, Survey, Alignment with Human Preference
Contribution Types: Surveys
Languages Studied: English, Chinese
Submission Number: 268
Loading