TL;DR: In this paper, we examine both the evaluation metric (VisualGPTScore) and current benchmarks for evaluating the compositionality of generative vision-language models, and propose a strategy to calibrate the morphological bias in current benchmarks.
Abstract: With the success of Large Language Models (LLMs), many Generative Vision-Language Models (GVLMs) have been constructed via multimodal instruction tuning. However, the performance of GVLMs in multimodal compositional reasoning remains largely unexplored, as existing evaluation metrics and benchmarks focus predominantly on assessing contrastive models like CLIP. In this paper, we examine both the evaluation metric (VisualGPTScore) and current benchmarks for evaluating the compositionality of GVLMs. We find that the VisualGPTScore is sensitive to sentence syntax rather than visual contents, so the curation methods of current benchmarks lead to severe morphological bias when evaluating with VisualGPTScore. To combat this, we define a MorphoBias Score to quantify the morphological bias and propose a novel LLM-based strategy to calibrate the benchmarks. Moreover, a novel and challenging task is added to evaluate the robustness of GVLMs against inherent inclination toward syntactic correctness. We include the calibrated dataset and the task into a new benchmark, namely MOrphologicall De-biased Benchmark (MODE). Our study provides the first unbiased benchmark for the compositionality of GVLMs, facilitating future research in this direction. We will release our code and datasets.
Paper Type: long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Contribution Types: Model analysis & interpretability, Data analysis
Languages Studied: English
0 Replies
Loading