Keywords: Explainablity, Vision-Language Models, Compositionality
TL;DR: Building compositionality robust CLIP model by region aware training objectives, pushing them towards better reasoning.
Abstract: Vision-Language Models (VLMs) such as CLIP excel at broad multimodal tasks, yet struggle with compositional reasoning. Despite capturing coarse correlations, they often act like “bags-of-words” missing fine-grained structures such as object–attribute bindings and inter-object relations. We attribute this to: (i) limited compositional diversity in large-scale image–text data, and (ii) contrastive objectives that emphasize global alignment over grounded structure. To address this, we propose a hierarchical fine-grained alignment framework that explicitly bridges visual and textual components at the object, attribute, and relation levels. Unlike prior work relying on parsers, we leverage scene graph annotated datasets for structured supervision, requiring no extra labeling. We introduce a hierarchical fine-grained loss to complement standard contrastive learning by grounding entities and relations across modalities. Experiments on compositional benchmarks SugarCrepe, What’sUp, and Cola show large gains in capturing nuanced structure, while preserving performance on standard vision-language tasks. RACA CLIP method improves compositional reasoning accuracy by +24.86% on SugarCrepe, +5.7% on What’sUp, and +4.76 on Cola, offering a simple yet effective path toward stronger, human-like compositional understanding.
Primary Area: interpretability and explainable AI
Submission Number: 24104
Loading