MMCOMPOSITION: Revisiting the Compositionality of Pre- trained Vision-Language Models

Hang Hua; Yolo Y. Tang; Ziyun Zeng; Liangliang Cao; Yang Zhengyuan; Hangfeng He; Chenliang Xu; Jiebo Luo

MMCOMPOSITION: Revisiting the Compositionality of Pre- trained Vision-Language Models

Hang Hua, Yolo Y. Tang, Ziyun Zeng, Liangliang Cao, Yang Zhengyuan, Hangfeng He, Chenliang Xu, Jiebo Luo

Published: 26 May 2026, Last Modified: 26 May 2026Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: The advent of large Vision-Language Models (VLMs) has significantly advanced multimodal under- standing, enabling more sophisticated and accurate integration of visual and textual information across various tasks, including image and video captioning, visual question answering, and cross-modal retrieval. Despite VLMs’ superior capabilities, researchers lack a comprehensive understanding of their compositionality – the ability to understand and produce novel combinations of known visual and textual components. Prior benchmarks provide only a relatively rough compositionality evaluation from the perspectives of objects, relations, and attributes while neglecting deeper reasoning about object interactions, counting, and complex compositions. However, compositionality is a critical ability that facilitates coherent reasoning and understanding across modalities for VLMs. To address this limitation, we propose MMCOMPOSITION, a novel human-annotated benchmark for comprehensively and accurately evaluating VLMs’ compositionality. With MMCOMPOSITION, we can quantify and explore the compositionality of the mainstream VLMs. Surprisingly, we find GPT-4o’s compositionality inferior to the best open-source model, and we analyze the underlying reasons. Our experimental analysis reveals the limitations of VLMs in fine-grained compositional perception and reasoning, and points to areas for improvement in VLM design and training.

Submission Type: Regular submission (no more than 12 pages of main content)

Previous TMLR Submission Url: https://openreview.net/forum?id=3E4tatLM3T&referrer=%5BAuthor%20Console%5D(%2Fgroup%3Fid%3DTMLR%2FAuthors%23your-submissions)

Changes Since Last Submission: We address the concerns raised by the reviewers.

Code: https://github.com/hanghuacs/MMComposition

Supplementary Material: pdf

Assigned Action Editor: ~Candace_Ross1

Submission Number: 7152

Loading