Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality

ACL ARR 2024 June Submission90 Authors

05 Jun 2024 (modified: 02 Aug 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: In this paper, we propose a new method to enhance compositional understanding in pre-trained vision and language models (VLMs) without sacrificing performance in the model's original zero-shot multi-modal tasks. Traditional fine-tuning methods often improve compositional reasoning at the expense of multi-modal capabilities. This drawback stems from the use of global hard negative loss, which contrasts the global representations of images and texts. This can distort multi-modal representations by pushing original texts due to ambiguous global representations. To address this, we propose the Fine-grained Selective Calibrated CLIP (FSC-CLIP). This incorporates local hard negative loss and selective calibrated regularization, designed to provide fine-grained negative supervision while maintaining the integrity of representations. Our extensive evaluation across various benchmarks for compositionality and multi-modal tasks shows that FSC-CLIP not only achieves compositionality on par with state-of-the-art models but also maintains multi-modal capabilities.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: image text matching,cross-modal application,multimodality
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 90
Loading