Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality

Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality

ACL ARR 2024 June Submission90 Authors

05 Jun 2024 (modified: 02 Aug 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: In this paper, we propose a new method to enhance compositional understanding in pre-trained vision and language models (VLMs) without sacrificing performance in the model's original zero-shot multi-modal tasks. Traditional fine-tuning methods often improve compositional reasoning at the expense of multi-modal capabilities. This drawback stems from the use of global hard negative loss, which contrasts the global representations of images and texts. This can distort multi-modal representations by pushing original texts due to ambiguous global representations. To address this, we propose the Fine-grained Selective Calibrated CLIP (FSC-CLIP). This incorporates local hard negative loss and selective calibrated regularization, designed to provide fine-grained negative supervision while maintaining the integrity of representations. Our extensive evaluation across various benchmarks for compositionality and multi-modal tasks shows that FSC-CLIP not only achieves compositionality on par with state-of-the-art models but also maintains multi-modal capabilities.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: image text matching,cross-modal application,multimodality

Contribution Types: Model analysis & interpretability

Languages Studied: English

Submission Number: 90

Loading