Keywords: Compositionality, Vision-Language Models, Causal Learning
Abstract: Recent work has empirically shown that Vision-Language Models (VLMs) struggle
to fully understand the compositional properties of the human language, usually
modeling an image caption as a “bag of words”. As a result, they perform
poorly on compositional tasks, which require a deeper understanding of the different
entities of a sentence (subject, verb, etc.) jointly with their mutual relationships
in order to be solved. In this paper, we model the dependency relations
among textual and visual tokens using a Causal Graphical Model (CGM), built using
a dependency parser, and we train a decoder conditioned by the VLM visual
encoder. Differently from standard autoregressive or parallel predictions, our decoder’s
generative process is partially-ordered following the CGM structure. This
structure encourages the decoder to learn only the main causal dependencies in
a sentence discarding spurious correlations. Using extensive experiments on five
compositional benchmarks, we show that our method significantly outperforms
all the state-of-the-art compositional approaches by a large margin, and it also improves
over methods trained using much larger datasets.
Our model weights and code are publicly available.
Primary Area: applications to computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 11332
Loading