Keywords: scientific claim verification, multimodal benchmark, structured reasoning
Abstract: Verifying scientific claims is a cornerstone of research integrity, yet it poses a significant challenge for automated systems, especially when claims involve multimodal evidence (e.g., text, tables, and figures). While large-scale models have shown promise, their underlying reasoning capabilities remain poorly understood. To address this, we introduce SciVerify-Digits, a new diagnostic benchmark designed to probe the structured reasoning and visual grounding abilities of multimodal models in a controlled, scientific context. Our benchmark synthesizes claims about visual data from MNIST, Fashion-MNIST, and SVHN, requiring models to perform tasks like counting, arithmetic, and logical inference. We evaluate a suite of models, from simple CNN-based architectures to attention-based fusion models and multimodal large language models (LLMs). Our findings reveal systemic failures across all architectures, particularly in generalization, permutation invariance, and robustness to adversarial claims. By providing a detailed failure analysis, including claim-type breakdowns and attention visualizations, this work establishes a framework for diagnosing critical weaknesses in current models and guiding the development of more reliable systems for real-world scientific verification.
Supplementary Material: zip
Submission Number: 202
Loading