Leveraging Knowledge Graphs to harvest a high-quality dataset for efficient CLIP model training

Simon Ging; Sebastian Walter; Hannah Bast; Thomas Brox

Leveraging Knowledge Graphs to harvest a high-quality dataset for efficient CLIP model training

Simon Ging, Sebastian Walter, Hannah Bast, Thomas Brox

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: vision-language, image-text, contrastive learning, CLIP, dataset, knowledge graph, wordnet, wikidata, entities, attributes, image search, imagenet, inaturalist, cub, ovad, attribute classification, animals, plants, small data, efficiency, open vocabulary

TL;DR: We train a CLIP model on the LivingThings dataset (8.9M images, 12.2M texts) focused on animals and plants, achieving comparable or superior performance to larger CLIP models on tasks like iNaturalist, using data from WordNet, Wikidata, and LLMs.

Abstract: Vision-language contrastive learning based on the CLIP method has been instrumental in driving recent advancements in computer vision. However, high quality CLIP models are based on very large datasets. This makes them expensive to train and hampers the scientific analysis of these models. We show how to train a CLIP base-size model efficiently for a broad domain on a much smaller amount of data. We demonstrate this specifically with the automated creation of a dataset named LivingThings with 8.9M images of animals and plants and 12.2M texts. The dataset is obtained via focused image-search queries of three kinds: entity queries (e.g., "eagle"), entity-attribute queries (e.g., "bushy tail of a fox"), and type-attribute queries (e.g., "insect on a leaf"). The entities and types, as well as some of the texts, are derived from the WordNet and Wikidata knowledge graphs, the attributes are obtained via LLMs. We train a CLIP model from scratch on LivingThings and evaluate it on ImageNet, iNaturalist, and CUB for object classification and OVAD and CUB for attribute classification. On the broad target domain of animals and plants, our model achieves comparable, and sometimes even much better performance than models that have orders of magnitude more parameters or training data. For instance, our ViT-B-32 model improves over much larger state-of-the-art CLIP models on the iNaturalist 21 object classification task. We will publicly release our code and dataset.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 10007

Loading