Cross-modal Mitigation of Spurious Correlation for Prompt-tuning in VLMs with Causally Motivated Logic Alignment

Xueyang Tang; Song Guo; Xiaosong Ma; Haoxi Li; Jie ZHANG; Yue Yu

Cross-modal Mitigation of Spurious Correlation for Prompt-tuning in VLMs with Causally Motivated Logic Alignment

Xueyang Tang, Song Guo, Xiaosong Ma, Haoxi Li, Jie ZHANG, Yue Yu

28 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Vision-Language Models, Prompt Tuning, Spurious Correlations, Out-of-Distribution Generalization, Causality, Probability of Necessity and Sufficiency

TL;DR: We achieve effective cross-modal mitigation of spurious correlations during prompt tuning of vision-language models by enhancing the causally motivated logic alignment between the textual label and filtered visual features.

Abstract: Recent studies have shown that pre-trained vision-language models can effectively adapt to diverse downstream tasks through parameter-efficient prompt tuning. Unfortunately, the tuned models can exploit spurious correlations during prediction, resulting in a failure to generalize to out-of-distribution test data, especially when the tuning dataset exhibits bias. How to achieve cross-modal mitigation of spurious correlations during prompt tuning of vision-language models remains an open question. In this paper, the challenging problem is tackled by leveraging the stable relationship between necessary and sufficient causal features and the corresponding label. On the one hand, we constrain the learning process of prompt by reinforcing the necessary and sufficient connection between the textual labels and textual features. On the other hand, the probability of necessity and sufficiency between the textual features and the filtered visual features is measured and maximized to enhance cross-modal feature alignment. By iteratively optimizing these two objectives, we can achieve cross-modal mitigation of spurious correlations because the logic equivalence between textual labels and visual features is bolstered. The theoretical analysis on generalization error indicates that our method can achieve a tighter generalization error bound than existing approaches. We evaluate the proposed method on several commonly adopted out-of-distribution datasets, and the empirical results demonstrate the superiority of our method over the state-of-the-art competitors.

Primary Area: other topics in machine learning (i.e., none of the above)

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 14151

Loading