MLLM-CBench:A Comprehensive Benchmark for Continual Instruction Tuning of Multimodal LLMs with Chain-of-Thought Reasoning Analysis

MLLM-CBench:A Comprehensive Benchmark for Continual Instruction Tuning of Multimodal LLMs with Chain-of-Thought Reasoning Analysis

ACL ARR 2025 May Submission7886 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Multimodal large language models (MLLMs) rely on continual instruction tuning to adapt to evolving real-world demands, yet progress is hindered by the lack of rigorous benchmarks. We present \textbf{MLLM-CTBench}, a comprehensive benchmark with three novel contributions: 1) \textbf{Competence-driven task curation} spanning six challenging domains, including about 70K examples from 17 datasets; 2) \textbf{Multidimensional evaluation} combining final answer accuracy with granular Chain-of-Thought (CoT) reasoning analysis, enabled by a specially trained CoT evaluator; 3) \textbf{Comprehensive Algorithmic Investigations} covering eight continual learning algorithms across four categories, as well as the reinforcement learning and supervised fine-tuning paradigm. Key findings include: i) Pre-training quality inversely correlates with forgetting susceptibility (e.g.,LLaVA-1.5 shows 2× higher forgetting than Qwen2.5-VL); ii) Reasoning chains degrade slower than final answers, supporting the hierarchical forgetting hypothesis; iii) Replay excels for weaker MLLMs but plateaus for robust models. MLLM-CTBench establishes rigorous standards for continual instruction tuning of MLLMs, offering actionable insights for algorithm design and evaluation.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: Multimodality and Language Grounding,Resources and Evaluation

Contribution Types: Publicly available software and/or pre-trained models, Data resources, Data analysis

Languages Studied: English

Submission Number: 7886

Loading