SciVerify-Digits: A Benchmark for Probing Multimodal Scientific Claim Verification

SciVerify-Digits: A Benchmark for Probing Multimodal Scientific Claim Verification

Agents4Science 2025 Conference Submission202 Authors

15 Sept 2025 (modified: 08 Oct 2025)Submitted to Agents4ScienceEveryoneRevisionsBibTeXCC BY 4.0

Keywords: scientific claim verification, multimodal benchmark, structured reasoning

Abstract: Verifying scientific claims is a cornerstone of research integrity, yet it poses a significant challenge for automated systems, especially when claims involve multimodal evidence (e.g., text, tables, and figures). While large-scale models have shown promise, their underlying reasoning capabilities remain poorly understood. To address this, we introduce SciVerify-Digits, a new diagnostic benchmark designed to probe the structured reasoning and visual grounding abilities of multimodal models in a controlled, scientific context. Our benchmark synthesizes claims about visual data from MNIST, Fashion-MNIST, and SVHN, requiring models to perform tasks like counting, arithmetic, and logical inference. We evaluate a suite of models, from simple CNN-based architectures to attention-based fusion models and multimodal large language models (LLMs). Our findings reveal systemic failures across all architectures, particularly in generalization, permutation invariance, and robustness to adversarial claims. By providing a detailed failure analysis, including claim-type breakdowns and attention visualizations, this work establishes a framework for diagnosing critical weaknesses in current models and guiding the development of more reliable systems for real-world scientific verification.

Supplementary Material: zip

Submission Number: 202

Loading