Understanding Hardness of Vision-Language Compositionality from A Token-level Causal Lens

Ziliang Chen; Tianang Xiao; Jusheng zhang; Yongsen Zheng; Zhao-Rong Lai

Understanding Hardness of Vision-Language Compositionality from A Token-level Causal Lens

Ziliang Chen, Tianang Xiao, Jusheng zhang, Yongsen Zheng, Zhao-Rong Lai

16 Sept 2025 (modified: 30 Oct 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: vision-language compositionality, alignment, theory of CLIP

Abstract: Contrastive Language-Image Pre-training (CLIP) leads the research thread of vision-language models with its remarkable generalizability across diverse tasks, achieved by learning the modal-invariant alignment between images and text. Whereas prevailing theoretical studies of CLIP, particularly from Causal Representation Learning (CRL), conventionally consider text input as a fixed-length vector. Their treatments fall short to explain CLIP-based prompting mechanisms with a focus on word and phrase details, impossibly to interpret several key technical characteristics behind CLIP, e.g., its failures in com positional understanding and the latent dependency on the prompt for OOD generalization. To mitigate such gap, this paper developed a more granular, token-aware CRL theory. Deriving existing Structural Causal Model (SCM) assump tions with sequential language token generation, we propose a new CRLframework to yield two critical contributions. First, our framework provided the first principled explanation for CLIP’s shortage in compositional reasoning, proving the non identifiability of "pseudo-optimal" text encoders that satisfies the alignment objective while fails to capture compositional semantics. Second, it establishes a novel connection between CLIP’s out-of-distribution generalization and Invariant Risk Minimization (IRM) models. Our theoretical insights lead to improved techniques for CLIP, which we validate empirically.

Supplementary Material: zip

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Submission Number: 6916

Loading