Data-Juicer Sandbox: A Feedback-Driven Suite for Multimodal Data-Model Co-development

Daoyuan Chen; Haibin Wang; Yilun Huang; Ce Ge; Yaliang Li; Bolin Ding; Jingren Zhou

Data-Juicer Sandbox: A Feedback-Driven Suite for Multimodal Data-Model Co-development

Daoyuan Chen, Haibin Wang, Yilun Huang, Ce Ge, Yaliang Li, Bolin Ding, Jingren Zhou

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 spotlightposterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: A new middleware links data and model feedback, enabling high performance and low cost verified in broad tasks.

Abstract: The emergence of multimodal large models has advanced artificial intelligence, introducing unprecedented levels of performance and functionality. However, optimizing these models remains challenging due to historically isolated paths of model-centric and data-centric developments, leading to suboptimal outcomes and inefficient resource utilization. In response, we present a new sandbox suite tailored for integrated data-model co-development. This sandbox provides a feedback-driven experimental platform, enabling cost-effective iteration and guided refinement of both data and models. Our proposed ``Probe-Analyze-Refine'' workflow, validated through practical use cases on multimodal tasks such as image-text pre-training with CLIP, image-to-text generation with LLaVA-like models, and text-to-video generation with DiT-based models, yields transferable and notable performance boosts, such as topping the VBench leaderboard. A comprehensive set of over 100 experiments demonstrated the suite's usability and extensibility, while also uncovering insights into the interplay between data quality, diversity, model behavior, and computational costs. All codes, datasets, and models are open-sourced to foster future research and applications that would otherwise be infeasible due to the lack of a dedicated co-development infrastructure.

Lay Summary: Developing powerful AI models that understand different types of information, like images and text, is challenging. Currently, the way we improve these models is often separated from how we prepare the data they learn from. This separation makes the process inefficient and hinders their full potential. To solve this, we created a specialized digital 'sandbox' called Data-Juicer Sandbox. This platform brings together model development and data preparation, allowing researchers to refine both simultaneously. It uses a 'Probe-Analyze-Refine' approach, where we test, evaluate, and then intelligently adjust both the AI model and its training data based on feedback. We've shown this approach works across various complex AI tasks, from understanding images and text together to generating videos, leading to significant performance improvements and even ranking first on a key benchmark. Our extensive experiments reveal crucial insights into how data quality, model behavior, and computational costs interact. By open-sourcing our tools, we aim to make this integrated development process accessible to everyone, speeding up future AI advancements.

Link To Code: https://modelscope.github.io/data-juicer/en/main/docs/Sandbox.html

Primary Area: Deep Learning->Foundation Models

Keywords: Data Processing, Multimodal Large Models, Sandbox

Submission Number: 8414

Loading