KnowData: Knowledge-Enabled Data Generation for Improving Multimodal Models

YiPeng Chen; Chulin Xie; Zinan Lin; Dawn Song; Bo Li

KnowData: Knowledge-Enabled Data Generation for Improving Multimodal Models

YiPeng Chen, Chulin Xie, Zinan Lin, Dawn Song, Bo Li

26 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal learning; Knowledge integration; Synthetic datasets; Zero-shot classification

TL;DR: In this paper, we introduce a framework to enhance the quality of synthetic image-text pairs for multimodal models such as CLIP by integrating real-world knowledge explicitly into the generation.

Abstract: In this paper, we introduce a framework to enhance the quality of synthetic image-text pairs for multimodal models such as CLIP. Our approach, named KnowData, integrates real-world knowledge explicitly into the generation of text descriptions. It combines structured knowledge from knowledge graphs like ConceptNet and unstructured knowledge extracted from Wikipedia, to ensure that the generated text descriptions are both contextually rich and accurately reflective of real-world knowledge. Additionally, we leverage Large Language Models for the expansion, summarization, and refinement of the text descriptions to ensure their coherence. These enriched texts are subsequently used to generate images through advanced text-to-image models like Stable Diffusion and DALLE-3. CLIP models are then fine-tuned with these synthetic image-text pairs for zero-shot classification tasks. Our experiments across 9 datasets demonstrate that CLIP models fine-tuned with our knowledge-guided synthetic datasets outperform state-of-the-art (SOTA) zero-shot CLIP methods (e.g., +11.23% on DTD and +4% on EuroSAT based on ViT-B/16 model; +11.47% on CIFAR-100 and +7.99% on DTD based on ResNet-50 model). These results showcase the improved out-of-distribution robustness and adaptability of our approach across a diverse set of data domains. We further substantiate the design of KnowData through ablation studies, revealing that the integration of knowledge not only enhances zero-shot performance but also contributes to the reliability, diversity, and detail-orientation of the generated synthetic images, thereby offering better data scaling laws for model performance.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 7341

Loading