AMVICC: A Novel Benchmark for Cross-Modal Failure Mode Profiling for VLMs and IGMs

Aahana Basappa; Pranay Goel; Anusri Karra; Anish Karra; Asa Gilmore; Kevin Zhu

AMVICC: A Novel Benchmark for Cross-Modal Failure Mode Profiling for VLMs and IGMs

Aahana Basappa, Pranay Goel, Anusri Karra, Anish Karra, Asa Gilmore, Kevin Zhu

Published: 12 Nov 2025, Last Modified: 22 Nov 2025VLM4RWD2025 RegularEveryoneRevisionsBibTeXCC BY 4.0

Track: Regular papers (within 8 pages excluding appendix)

Keywords: machine learning, image generation modes, MLLMs, failure modes, benchmark, cross-modal, classification

TL;DR: We create a novel benchmark to evaluate and profile the cross-modal failures of various image generation models and multimodal-LLMs.

Abstract: We investigated visual reasoning limitations of both multimodal large language models (MLLMs) and image generation models (IGMs) by creating a novel benchmark to systematically compare failure modes across image-to-text and text-to-image tasks, enabling cross-modal evaluation of visual understanding. Despite rapid growth in machine learning, vision language models (VLMs) still fail to understand or generate basic visual concepts like object orientation, quantity, or spatial relationships—highlighting gaps in elementary visual reasoning. By adapting MMVP benchmark questions into explicit and implicit prompts, we create AMVICC, a novel benchmark for profiling failure modes across various modalities. After testing 11 MLLMs and 3 IGMs across nine visual reasoning categories, our results show that failure modes are often shared across models and modalities, but certain failures are model-specific and modality-specific can be potentially attributed to various factors. IGMs consistently struggled to manipulate specific visual components in response to prompts, especially in explicit, suggesting poor control over fine-grained visual attributes. Our findings apply most directly to the evaluation of existing state-of-the-art models on structured visual reasoning tasks. This work lays a foundation for future cross-modal alignment studies, offering a framework to probe whether generation and interpretation failures stem from shared limitations—guiding future improvements in unified vision-language modeling.

Confirmation: I have read and agree with the workshop's policy on behalf of myself and my co-authors.

Submission Number: 8

Loading