Keywords: vision-language compositionality, alignment, theory of CLIP
Abstract: Contrastive Language-Image Pre-training (CLIP) leads the research thread of vision-language models with its remarkable generalizability across diverse tasks, achieved by learning the modal-invariant alignment between images and text. Whereas prevailing theoretical studies of CLIP, particularly from Causal Representation Learning (CRL), conventionally consider text input as a fixed-length vector. Their treatments fall short to explain CLIP-based prompting mechanisms with a focus on word and phrase details, impossibly to interpret several key technical characteristics behind CLIP, e.g., its failures in com positional understanding and the latent dependency on the prompt for OOD generalization. To mitigate such gap, this paper developed a more granular, token-aware CRL theory. Deriving existing Structural Causal Model (SCM) assump tions with sequential language token generation, we propose a new CRLframework to yield two critical contributions. First, our framework provided the first principled explanation for CLIP’s shortage in compositional reasoning, proving the non identifiability of "pseudo-optimal" text encoders that satisfies the alignment objective while fails to capture compositional semantics. Second, it establishes a novel connection between CLIP’s out-of-distribution generalization and Invariant Risk Minimization (IRM) models. Our theoretical insights lead to improved techniques for CLIP, which we validate empirically.
Supplementary Material: zip
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 6916
Loading