Keywords: high-quality dataset, multimodal dataset, interleaved image-text synergy, interleaved evaluation
Abstract: Recent advancements in Large Multimodal Models (LMMs) have significantly improved multimodal understanding and generation.
However, these models still struggle to generate tightly interleaved image-text outputs, primarily due to the limited scale, quality and instructional richness of current training datasets.
To address this, we introduce \textbf{InterSyn}, a dataset that features:
(1) large scale, comprising 1.8M multimodal samples;
(2) high quality, supported by our proposed \textbf{Self-Evaluation with Iterative Refinement (SEIR)} method for rigorous automated quality refinement;
(3) rich instructional diversity, ensured through diverse well-designed question templates, based on human preferences and covering a 3500-topic hierarchy.
These characteristics make InterSyn particularly well-suited for training LMMs in interactive image–text generation capabilities.
To evaluate the capabilities, we propose \textbf{SynJudge}, a reliable automatic evaluator that aligns closely with human judge and outputs four interpretable scores: Text Content Completeness (TCC), Image Content Completeness (ICC), Image Quality (IQ), and Image–Text Synergy (ITS).
These scores are complementary, covering both content and quality as well as cross-modal interaction, thereby forming a comprehensive evaluation framework.
Experimental results on InterSyn subsets of up to 200K samples show that 25K–50K already yield substantial improvements, while scaling to 100K/200K brings further gains in TCC, ICC, and especially ITS, highlighting InterSyn’s:
(1) scalability, as performance consistently improves with more data;
(2) efficiency, as significant gains are achievable even with smaller subsets, making it accessible to researchers with varying computational resources.
Primary Area: datasets and benchmarks
Submission Number: 17265
Loading