Keywords: Multimodal causal reasoning, Infographic understanding, Benchmark and evaluation, Vision–language models
Abstract: Recent advances in Vision-Language Models (VLMs) have shown strong performance in perception and reasoning, yet their ability to perform causal inference—an essential aspect of human cognition—remains underexplored in multimodal settings. We introduce InfoCausalQA, a benchmark for evaluating causal reasoning grounded in infographics that integrate structured visual data with textual context. InfoCausalQA consists of two tasks: Task 1 evaluates quantitative causal reasoning based on inferred numerical trends, while Task 2 targets semantic causal reasoning across five relation types—cause, effect, intervention, counterfactual, and temporal. We collect 494 infographic–text pairs from four public sources and generate 1,482 multiple-choice QA pairs using GPT-4o, followed by systematic human revision to ensure that questions require genuine visual grounding rather than surface-level cues. Experimental results show that current VLMs struggle with both quantitative and semantic causal reasoning, with particularly pronounced limitations in the latter. A human evaluation on 100 Task 2 samples further reveals a substantial performance gap, with humans achieving 77\% accuracy. These findings highlight the need to advance causal reasoning capabilities in multimodal AI systems.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: vision question answering, cross-modal information extraction, image–text matching
Contribution Types: Model analysis & interpretability, Data analysis
Languages Studied: English
Submission Number: 8469
Loading