CLoVe: Encoding Compositional Language in Contrastive Vision-Language ModelsDownload PDF

Anonymous

16 Dec 2023ACL ARR 2023 December Blind SubmissionReaders: Everyone
TL;DR: CLoVe improves the language compositional skills of contrastive vision-language models while keeping their performance in other tasks.
Abstract: Recent years have witnessed a significant increase in the performance of Vision and Language tasks. Foundational Vision-Language Models (VLMs), such as CLIP, have been leveraged in multiple settings and demonstrated remarkable performance across several tasks. Such models excel at object-centric recognition yet learn text representations that seem invariant to word order, failing to compose known concepts in novel ways. However, no evidence exists that any VLM, including large-scale single-stream models such as GPT-4V, identifies compositions successfully. In this paper, we introduce a method to significantly improve the ability of existing models to encode compositional language, with over 10% absolute improvement on standard benchmarks, while maintaining the performance on more standard benchmarks. In this paper, we present a method to considerably improve the compositionality of CLIP-like pre-trained models while preserving its performance on other tasks. We will provide model weights that can be used to swap existing CLIP-like weights and have a noticeable boost in compositional performance.
Paper Type: long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Publicly available software and/or pre-trained models
Languages Studied: English
0 Replies

Loading