CycleAug: Cycle-Consistent Visual Augmentation for Large Multimodal Models

Zhongrui Gui; Luoxin Ye; Wufei Ma; Zhao-Yang Wang; Ariel Lubonja; Daniel Khashabi; Alan Yuille; Jieneng Chen

CycleAug: Cycle-Consistent Visual Augmentation for Large Multimodal Models

Zhongrui Gui, Luoxin Ye, Wufei Ma, Zhao-Yang Wang, Ariel Lubonja, Daniel Khashabi, Alan Yuille, Jieneng Chen

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: large multimodal models; synthetic data; data augmentation

TL;DR: We introduce CycleAug, a novel framework that efficiently generates diverse synthetic images from existing IQA triplets through cycle-consistency, augmenting the visual instruction tuning of MLLMs.

Abstract: Training multimodal large language models (MLLMs) requires high-quality image-question-answer (IQA) triplets, which are labour-intensive to curate and often lack diversity. We propose a novel data augmentation framework for visual instruction tuning that efficiently generates diverse synthetic images based on existing IQA anchor triplets. To ensure that the generated images align with their associated QA pairs, we propose CycleAug --- cycle-consistency visual augmentation which involves synthesizing images from text (text $\rightarrow$ image) and then performing a verification step to confirm that the answers derived from the synthetic images match the original answers (image $\rightarrow$ text), ensuring consistency across images, questions, and answers. By combining synthetic images with high-quality real data in the training phase, we demonstrate these synthetic triplets act as an implicit regularization, which improves the robustness of MLLMs and enables analogical reasoning. Extensive experiments show that our approach improves model performance on multiple visual question-answering benchmarks without additional real-world data. This work highlights the potential of leveraging visual foundational models to enhance visual instruction tuning in MLLMs.

Primary Area: foundation or frontier models, including LLMs

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 11706

Loading