PRISMM-Bench: A Benchmark of Peer-Review Grounded Multimodal Inconsistencies

Published: 26 Jan 2026, Last Modified: 27 Feb 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Multimodal Models, Scentific document understanding, evaluation benchmark
TL;DR: PRISMM-Bench is the first benchmark of real reviewer-flagged multimodal inconsistencies in scientific papers, revealing that even state-of-the-art LMMs struggle to detect and resolve them.
Abstract: Large Multimodal Models (LMMs) are increasingly applied to scientific research, yet it remains unclear whether they can reliably understand and reason over the multimodal complexity of papers. A central challenge lies in detecting and resolving inconsistencies across text, figures, tables, and equations, issues that are often subtle, domain-specific, and ultimately undermine clarity, reproducibility, and trust. Existing benchmarks overlook this issue, either isolating single modalities or relying on synthetic errors that fail to capture real-world complexity. We introduce PRISMM-Bench (Peer-Review-sourced Inconsistency Set for Multimodal Models), the first benchmark grounded in real reviewer-flagged inconsistencies in scientific papers. Through a multi-stage pipeline of review mining, LLM-assisted filtering and human verification, we curate 384 inconsistencies from 353 papers. Based on this set, we design three tasks, namely inconsistency identification, remedy and pair matching, which assess a model's capacity to detect, correct, and reason over inconsistencies across different modalities. Furthermore, to address the notorious problem of choice-only shortcuts in multiple-choice evaluation, where models exploit answer patterns without truly understanding the question, we further introduce structured JSON-based answer representations that minimize linguistic biases by reducing reliance on superficial stylistic cues. We benchmark 21 leading LMMs, including large open-weight models (GLM-4.5V 106B, InternVL3 78B) and proprietary models (Gemini 2.5 Pro, GPT-5 with high reasoning). Results reveal strikingly low performance (27.8-53.9%), underscoring the challenge of multimodal scientific reasoning and motivating progress towards trustworthy scientific assistants.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 280
Loading