Strategy-centric Synthesis: Connecting Billions of Image-Text Pairs to High-Quality Visual Instruction Data

Zhi Li; Yicheng Li; Yin Zhang; Daizong Liu; Xiaoye Qu; Yu Cheng

Strategy-centric Synthesis: Connecting Billions of Image-Text Pairs to High-Quality Visual Instruction Data

Zhi Li, Yicheng Li, Yin Zhang, Daizong Liu, Xiaoye Qu, Yu Cheng

27 Sept 2024 (modified: 15 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Visual Instruction Tuning, Strategy, Synthetic Data, Complex, Diverse, Scalable

Abstract:

Vision-Language Models (VLMs) have demonstrated remarkable generalization across tasks by aligning visual and linguistic representations. High-quality visual instruction data is critical for enhancing the performance of Vision-Language Models. However, current visual instruction tuning datasets, which are primarily derived from past visual tasks, have several limitations. For instance, the range of question types is often restricted and closely tied to the original visual tasks. Furthermore, image diversity is limited, as images collected for various specialized vision tasks clearly fail to adequately represent real-world user queries. Additionally, previous instruction datasets tend to lack complexity, focusing on single tasks like captioning or OCR, which makes it challenging to train models for more complex, multi-skill scenarios. To address these limitations, we propose a novel paradigms called strategy-centric synthesis: automatically synthesizing high-quality instruction data from large-scale image-text pairs. First, we employ an efficient heuristic method to select high-quality, complex images from DataComp-1B image-text pairs. Carefully crafted prompts and these images are fed to VLMs to extract high-quality query strategies and generate corresponding image descriptions. These descriptions are subsequently used to retrieve images aligned with specific questioning strategies. Finally, the retrieved images and their matching strategies are used to synthesize high-quality instructional data. Our experiments indicate that with continued instruction fine-tuning via LoRA on only 3,000 newly synthesized data samples, 0.45% of the LLAVA-1.5 instruction tuning dataset, the model significantly outperforms the original LLAVA-1.5-7B across multiple benchmarks, thereby demonstrating the effectiveness of our approach.

Primary Area: generative models

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 10564

Loading