CLIP Exhibits Improved Compositional Generalization Through Representation Disentanglement

24 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: visualization or interpretation of learned representations
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Compositional generalization, Out-of-distribution generalization, Vision-language models, CLIP, Disentangled representations, Language supervision, data-centric AI
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Abstract: Vision-language models (VLMs), such as CLIP, have shown promising Out-of-Distribution (OoD) generalization under various flavors of distribution shifts. Recent studies attempted to investigate the leading cause of this property. In this work, we target the same goal, but focus on a certain type of distribution shift, in which test images contain unseen compositions of attribute-object pairs, but with the objects and attributes being individually seen during training. The models are expected to classify those images into the composition classes, i.e. attribute-object pairs, and also into object classes by ignoring attributes. We carefully designed an authentic image test dataset consisting of attributes for objects that are unlikely encountered in the CLIP training data. We found that the compositions diversity in the training data, as measured by normalized mutual information between objects and attributes, has a significant effect on the improvement of compositional generalization in the CLIP models. We found that image/text representation disentanglement with respect to the composition constituents also plays a key role in the improved generalization of these models. We notice that larger training datasets could potentially trigger emergence of such a disentanglement, as the compositions are typically more diverse in such datasets. We validate this hypothesis through different representation disentanglement metrics, including Z-Diff, and explicitness scores for various CLIPs. Our findings reveal a correlation between better OoD performance and higher scores in these disentanglement metrics, suggesting that improved disentanglement potentially contributes to enhanced compositional OoD generalization in VLMs.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 9428
Loading