Unlocking Compositional Understanding of Vision-Language Models with Visualization Representation and Analysis

Tong Li; Guodao Sun; Qi Jiang; Xueqian zheng; Wang Xia; yunchao wang; Jingwei Tang; Li Jiang; Haixia Wang; Ronghua Liang

Unlocking Compositional Understanding of Vision-Language Models with Visualization Representation and Analysis

Tong Li, Guodao Sun, Qi Jiang, Xueqian zheng, Wang Xia, yunchao wang, Jingwei Tang, Li Jiang, Haixia Wang, Ronghua Liang

27 Sept 2024 (modified: 19 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Vision-Language Models, Compositional Understanding, Visualization Representation and Analysis

TL;DR: This paper introduces an interactive visualization representation and analysis approach from outside the computer vision community.To our knowledge, this is the first exploration of VLMs' compositional understanding from visualization representation.

Abstract: Vision-language models (VLMs) have made significant advances, debates persist about their ability to understand the combined meaning of vision and linguistic. Existing research primarily relies on computer vision knowledge and static images to deliver findings and insights into compositional understanding of VLMs. There is still a limited understanding of how VLMs handle subtle differences between visual and linguistic information. This paper introduces an interactive visualization representation and analysis approach from outside the computer vision community. In this study, we found that CLIP's performance in compositional understanding only slightly exceeds the chance level of 50%. Particularly, it primarily relies on entities in visual and textual modalities, but is limited in recognizing spatial relationships, attribute ownership, and interaction relationships. Additionally, It behaves more like a bag-of-words model and relies on global feature alignment rather than fine-grained alignment, leading to insensitivity to subtle perturbations in text and images.

Primary Area: applications to computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 10011

Loading