Compositional by Design: Background-Invariant Representations via Linear Additivity in VLMs
Keywords: Vision-Language Models, Spurious Correlations, Compositionality, Linear Additivity, Representation Learning, Robustness, Out-of-Domain Generalization
TL;DR: We exploit the inherent compositional linear additivity of VLM embedding spaces to develop a background-invariant pre-training method that systematically neutralizes spurious correlations and achieves state-of-the-art out-of-domain robustness.
Abstract: Vision-language models (VLMs), such as CLIP and SigLIP 2, are widely deployed, yet they often fail to systematically generalize out-of-domain, relying on statistical shortcuts rather than true compositional understanding. A prominent and practically critical failure mode is background-based spurious correlations, where models improperly entangle reusable components—foreground objects and their backgrounds. In this paper, we systematically quantify how VLMs internally represent compositionality, specifically examining the linear additivity of foreground and background concepts. Leveraging these internal representational structures, we develop a new mitigation method, Background-invariant Anchor Pre-training (BAP). Our method explicitly isolates compositional features to achieve state-of-the-art worst-group accuracy exceeding 90% on Waterbirds under perfect spurious correlation (no minority-group examples in the training data). BAP demonstrates how enforcing modular, compositional structures can drive robust, out-of-domain generalization across benchmarks. Highly practical, it relies exclusively on synthetic data and exhibits strong sim-to-real transfer, paving the way for safer deployment in real-world scenarios
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 121
Loading