Both Text and Images Leaked! A Systematic Analysis of Data Contamination in Multimodal LLMs

Published: 10 Jun 2025, Last Modified: 13 Jul 2025DIG-BUG OralEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Data Contamination, Data Memorization
TL;DR: We conducted the first systematic study on data contamination in multimodal LLMs.
Abstract: The rapid advancement of multimodal large language models (MLLMs) has significantly enhanced performance across benchmarks. However, data contamination—unintentional memorization of benchmark data during model training—poses critical challenges for fair evaluation. Existing detection methods for unimodal large language models (LLMs) are inadequate for MLLMs due to multimodal data complexity and multi-phase training. We systematically analyze multimodal data contamination using our analytical framework, MM-DETECT, which defines two contamination categories—unimodal and cross-modal—and effectively quantifies contamination severity across multiple-choice and caption-based Visual Question Answering tasks. Evaluations on twelve MLLMs and five benchmarks reveal significant contamination, particularly in proprietary models and older benchmarks. Crucially, contamination sometimes originates during unimodal pre-training rather than solely from multimodal fine-tuning. Our insights refine contamination understanding, guiding evaluation practices and improving multimodal model reliability.
Submission Number: 21
Loading