CauFed-CLIP: Causal Federated Vision-Language Models for Domain Generalization

Zhenyuan Huang; Hui Zhang; Haijun Yang

CauFed-CLIP: Causal Federated Vision-Language Models for Domain Generalization

Zhenyuan Huang, Hui Zhang, Haijun Yang

16 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Federated learning, Causal reference, CLIP

TL;DR: We propose CauFed-CLIP, a novel Causal-based Federated Contrastive Language-Image Pre-training model.

Abstract: Although visual language models (VLMs) have achieved remarkable success, applying them directly in federated learning (FL) faces key challenges: high communication/computation costs and poor generalization due to client data heterogeneity. To tackle these, we propose CauFed-CLIP, a novel Causal-based Federated Contrastive Language-Image Pre-training model. Our model reduces overhead by freezing the VLM backbone and training a lightweight causal module on clients. To enhance generalization, our model employs a progressive causal mechanism. It first disentangles observed features (x) into domain-invariant (s) and domain-variant (z) representations, aided by global and local guidance to suppress their spurious correlations. From this disentangled foundation, it then infers the underlying causal "concept" (c)-a quasi-invariant latent variable that represents the essence of a category and holds a weak causal link with the domain (z). Ultimately, relying solely on this pure concept "c" for prediction allows the model to transcend superficial statistics and grasp the core causal logic.

Supplementary Material: zip

Primary Area: causal reasoning

Submission Number: 6785

Loading