MLLM-CBench:A Comprehensive Benchmark for Continual Instruction Tuning of Multimodal LLMs with Chain-of-Thought Reasoning Analysis

ACL ARR 2025 May Submission7886 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Multimodal large language models (MLLMs) rely on continual instruction tuning to adapt to evolving real-world demands, yet progress is hindered by the lack of rigorous benchmarks. We present \textbf{MLLM-CTBench}, a comprehensive benchmark with three novel contributions: 1) \textbf{Competence-driven task curation} spanning six challenging domains, including about 70K examples from 17 datasets; 2) \textbf{Multidimensional evaluation} combining final answer accuracy with granular Chain-of-Thought (CoT) reasoning analysis, enabled by a specially trained CoT evaluator; 3) \textbf{Comprehensive Algorithmic Investigations} covering eight continual learning algorithms across four categories, as well as the reinforcement learning and supervised fine-tuning paradigm. Key findings include: i) Pre-training quality inversely correlates with forgetting susceptibility (e.g.,LLaVA-1.5 shows 2× higher forgetting than Qwen2.5-VL); ii) Reasoning chains degrade slower than final answers, supporting the hierarchical forgetting hypothesis; iii) Replay excels for weaker MLLMs but plateaus for robust models. MLLM-CTBench establishes rigorous standards for continual instruction tuning of MLLMs, offering actionable insights for algorithm design and evaluation.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: Multimodality and Language Grounding,Resources and Evaluation
Contribution Types: Publicly available software and/or pre-trained models, Data resources, Data analysis
Languages Studied: English
Submission Number: 7886
Loading