Keywords: Image–Caption Evaluation; MLLMs; CoT; Data Filtering; Semantic Unit Decomposition
TL;DR: We propose Chain-of-Atoms (CoA), a fine-grained framework for evaluating and filtering multimodal datasets, along with a corresponding data generation method that enables more efficient and robust MLLM training with less but cleaner data.
Abstract: In recent years, Multimodal Large Language Models (MLLMs) have achieved remarkable progress across a wide range of domains, largely benefiting from the availability of large-scale multimodal datasets, particularly image-caption corpora. Nevertheless, the community has long lacked a universal and standardized data quality assessment framework specifically designed for such corpora. In this paper, we propose the Chain-of-Atoms (CoA) evaluation framework along with a corresponding Bottom2Up data sampling strategy. CoA decomposes both captions and images into minimal information units and computes precision and recall as objective sub-metrics. By reweighting these sub-metrics dynamically, we introduce a style-adaptive $F_1$ (SAF1) metric to achieve better correlation with human preference. To enhance the capability of semantic decomposition, we apply the proposed Bottom2Up strategy to construct a balanced and large-scale training dataset. We also establish CoA Bench, a standardized benchmark for fine-grained image-caption evaluation. Experimental results on CoA Bench and other downstream tasks demonstrate that CoA effectively filters noisy training samples, significantly improves the robustness and training efficiency of MLLM. Specifically, CoA‑based data filtering during MLLM pre-training reduces the training data by 81.5% without causing performance degradation.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 3628
Loading