Data-Juicer Sandbox: A Comprehensive Suite for Multimodal Data-Model Co-development

Daoyuan Chen; Haibin Wang; Yilun Huang; Ce Ge; Yaliang Li; Bolin Ding; Jingren Zhou

Data-Juicer Sandbox: A Comprehensive Suite for Multimodal Data-Model Co-development

Daoyuan Chen, Haibin Wang, Yilun Huang, Ce Ge, Yaliang Li, Bolin Ding, Jingren Zhou

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Data Processing, Multimodal Generative Models, Sandbox

TL;DR: We introduce a sandbox suite for data-model co-development that boosts multimodal generative models and provides fruitful insights into data impact on model performance.

Abstract: The emergence of large-scale multimodal generative models has drastically advanced artificial intelligence, introducing unprecedented levels of performance and functionality. However, optimizing these models remains challenging due to historically isolated paths of model-centric and data-centric developments, leading to suboptimal outcomes and inefficient resource utilization. In response, we present a novel sandbox suite tailored for integrated data-model co-development. This sandbox provides a comprehensive experimental platform, enabling rapid iteration and insight-driven refinement of both data and models. Our proposed ``Probe-Analyze-Refine'' workflow, validated through applications on image-to-text and text-to-video tasks with state-of-the-art LLaVA-like and DiT-based models, yields significant performance boosts, such as topping the VBench leaderboard. We also uncover fruitful insights gleaned from exhaustive benchmarks, shedding light on the critical interplay between data quality, diversity, and model behavior. All codes, datasets and models are openly accessible and continuously maintained to foster future progress.

Primary Area: other topics in machine learning (i.e., none of the above)

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 8902

Loading