MagicGen: A Universal Multimodal Data Synthesis Agent for Domain-Specific Vision-Language Model Tuning

13 Sept 2025 (modified: 12 Feb 2026)ICLR 2026 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multimodal Data Synthesis, Vision-Language Model, Agent
Abstract: Vision–language models (VLMs) need large, domain-aligned multimodal data, yet high-quality collection is costly and slow, especially in specialized domains with privacy, expertise, and distribution-shift constraints. Current synthesis methods are narrow, labor-intensive, or lack rigorous QC, yielding brittle pipelines and noisy supervision. We introduce MagicGen, a universal agent that composes end-to-end, domain-specific data pipelines from natural-language prompts. Using unified interfaces, MagicGen selects and chains tools for image synthesis, text generation, augmentation, and modality transformation, enabling modular and scalable composition. The agent is trained with hybrid supervision: expert-authored reference pipelines plus LLM-generated candidates iteratively verified by humans for robust cross-domain generalization. We also propose an automated hierarchical evaluation pipeline: Image Validation (aesthetic + technical metrics) and Annotation Validation (multi-model discriminator with iterative decisions) for reliable quality control. Across diverse VLM tuning scenarios, MagicGen boosts data quality, reduces manual effort, and accelerates scalable dataset construction. It outperforms strong baselines on downstream tasks with less human oversight, and ablations confirm the importance of curated tool modularity, hierarchical evaluation, and hybrid training.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 4601
Loading