Learning Invariant Causal Mechanism from Vision-Language Models

Zeen Song; Siyu Zhao; Xingyu Zhang; Jiangmeng Li; Changwen Zheng; Wenwen Qiang

Learning Invariant Causal Mechanism from Vision-Language Models

Zeen Song, Siyu Zhao, Xingyu Zhang, Jiangmeng Li, Changwen Zheng, Wenwen Qiang

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Contrastive Language-Image Pretraining (CLIP) has achieved remarkable success, but its performance can degrade when fine-tuned in out-of-distribution (OOD) scenarios. We model the prediction process using a Structural Causal Model (SCM) and show that the causal mechanism involving both invariant and variant factors in training environments differs from that in test environments. In contrast, the causal mechanism with solely invariant factors remains consistent across environments. We theoretically prove the existence of a linear mapping from CLIP embeddings to invariant factors, which can be estimated using interventional data. Additionally, we provide a condition to guarantee low OOD risk of the invariant predictor. Based on these insights, we propose the Invariant Causal Mechanism of CLIP (CLIP-ICM) framework. CLIP-ICM involves collecting interventional data, estimating a linear projection matrix, and making predictions within the invariant subspace. Experiments on several OOD datasets show that CLIP-ICM significantly improves the performance of CLIP. Our method offers a simple but powerful enhancement, boosting the reliability of CLIP in real-world applications.

Lay Summary: Large-scale vision-language models like CLIP (which learns to match images and text) have proven remarkably good at identifying images without needing to be trained on specific tasks. However, when these models are fine-tuned for new, real-world tasks, they often struggle to generalize — especially when the test data looks different from what the model saw during training. Our research investigates why this happens and how to fix it. We take a causal perspective, modeling how different "hidden factors" influence CLIP’s predictions. Some factors are consistent across different environments (like an animal’s shape), while others vary (like lighting or background). We show that if a model relies too much on the variable factors, its predictions can break down in new situations. But if it uses only the consistent ones, it can make reliable predictions even when the environment changes. We then prove that these consistent factors can be recovered from CLIP’s internal features using a simple linear transformation — but only if we have access to carefully designed “intervention” data (like changing only one thing in an image at a time). Based on this, we introduce a new framework, CLIP-ICM, that projects CLIP’s features into an "invariant" space before making predictions. This process doesn’t require retraining CLIP itself, just a light extra step. Across several challenging benchmarks, our approach improves accuracy significantly. It helps CLIP maintain its zero-shot power (handling new categories it has never seen) while becoming more reliable when the environment shifts — a crucial step for deploying AI systems in the real world.

Link To Code: https://github.com/ZeenSong/CLIP-ICM

Primary Area: General Machine Learning->Representation Learning

Keywords: Vision-Language Models, Causal representation Learning, Out-of-distribution generalization, Representation Learning

Submission Number: 593

Loading