Analyze, Generate, Improve: Failure-Based Data Generation for Large Multimodal Models

Gabriela Ben-Melech Stan; Estelle Aflalo; Avinash Madasu; Vasudev Lal; Phillip Howard

Analyze, Generate, Improve: Failure-Based Data Generation for Large Multimodal Models

Gabriela Ben-Melech Stan, Estelle Aflalo, Avinash Madasu, Vasudev Lal, Phillip Howard

Published: 06 May 2025, Last Modified: 29 May 2025SynData4CVEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Synthetic data, model failures, visual instruction tuning

Abstract: Training models on synthetic data is an effective strategy for improving large multimodal models (LMMs) due to the scarcity of high-quality paired image-text data. Existing methods generate multimodal datasets but do not address specific reasoning deficiencies in LMMs. In contrast, humans learn efficiently by focusing on past failures. Inspired by this, we propose a synthetic data generation approach that analyzes an LMM’s reasoning failures using frontier models to generate and filter high-quality examples. Our method produces a 553k-example multimodal instruction tuning dataset, leading to improved LMM performance, even surpassing models trained on equivalent real data demonstrating the high value of generating synthetic data targeted to specific reasoning failure modes in LMMs.

Supplementary Material: pdf

Submission Number: 45

Loading