When and How Does CLIP Enable Domain and Compositional Generalization?

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 spotlightposterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We studied CLIP's domain and compositional generalization via systematic data-centric experiments and mechanistic analyses, revealing that domain diversity, sufficiently shared intermediate features and circuitry are crucial for generalization.
Abstract: The remarkable generalization performance of contrastive vision-language models like CLIP is often attributed to the diversity of their training distributions. However, key questions remain unanswered: Can CLIP generalize to an entirely unseen domain when trained on a diverse mixture of domains (domain generalization)? Can it generalize to unseen classes within partially seen domains (compositional generalization)? What factors affect such generalization? To answer these questions, we trained CLIP models on systematically constructed training distributions with controlled domain diversity and object class exposure. Our experiments show that domain diversity is essential for both domain and compositional generalization, yet compositional generalization can be surprisingly weaker than domain generalization when the training distribution contains a suboptimal subset of the test domain. Through data-centric *and* mechanistic analyses, we find that successful generalization requires the learning of sufficiently shared representations in intermediate layers and circuits.
Lay Summary: CLIP is a widely used foundation model that is often used to help large language models understand images. In this work, we studied how well CLIP generalizes to unseen visual styles (called *domains*) of images of objects and animals, such as photos, drawings, or sketches. We focused on two key questions: (1) Can CLIP generalize to sketches (or any other domain) it has not seen during training? (2) If CLIP has seen sketches of cats and photos, paintings, etc. of both cats and dogs, can it generalize to sketches of dogs? To address these questions, we created carefully curated datasets that control for seen and unseen domains and object/animal classes. Our experiments reaffirmed that including *diverse* domains in training improves generalization. Surprisingly, CLIP can sometimes perform worse when it has *partially* seen a domain than when it has *not seen it at all*. We also found that robust generalization requires learning *sufficiently shared internal representations and mechanisms* across domains. Our findings reveal previously overlooked limitations in CLIP’s generalization and advance our understanding of the factors that affect generalization. These insights can guide the development of models that generalize more reliably across visual domains.
Link To Code: https://github.com/lmb-freiburg/understanding-clip-ood
Primary Area: Deep Learning->Robustness
Keywords: CLIP, Compositional Generalization, Domain Generalization, Out-of-Distribution Robustness, OOD generalization
Submission Number: 1549
Loading