RPC-Bench: A Fine-grained Benchmark for Research Paper Comprehension

Yelin Chen; Fanjin Zhang; Suping Sun; Yunhe Pang; Yuanchun Wang; Jian Song; XiaoYan Li; Lei Hou; Shu Zhao; Jie Tang; Juanzi Li

RPC-Bench: A Fine-grained Benchmark for Research Paper Comprehension

Yelin Chen, Fanjin Zhang, Suping Sun, Yunhe Pang, Yuanchun Wang, Jian Song, XiaoYan Li, Lei Hou, Shu Zhao, Jie Tang, Juanzi Li

20 Sept 2025 (modified: 05 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Research Paper Comprehension, Multimodal Benchmark, LLM-as-Judge

TL;DR: We present RPC-Bench, a fine-grained multimodal benchmark for academic papers with a taxonomy-driven annotation scheme and multi-dimensional evaluation, showing that current LLMs and VLMs struggle with thorough research paper comprehension.

Abstract: Leveraging large foundation models for document understanding has emerged as a rapidly advancing research area. Unlike general-purpose documents, research papers constitute a particularly challenging domain, as they are characterized by complex figures, detailed tables, and highly specialized scientific knowledge. However, existing benchmarks pay limited attention to evaluating the fine-grained capabilities of current models in comprehending research papers at scale. To address this gap, we propose RPC-Bench, a large-scale fine-grained question-answering benchmark constructed from review-rebuttal exchanges of high-quality academic papers, with each paper available in two input formats (pure text and rendered page images) enabling evaluation of both large language models (LLMs) and visual language models (VLMs). We design a fine-grained taxonomy aligned with the research flow of academic papers to guide annotation. We also define an elaborate LLM–human interaction annotation framework to support large-scale labeling and quality control. Following the LLM-as-a-Judge paradigm, we develop a scalable framework that evaluates models on correctness-completeness and conciseness, with high agreement to human judgment. Experiments show GPT-5 leads with a 66.54% correctness-completeness score, dropping to 35.05% after conciseness adjustment. In addition, multimodal LLMs perform better on pure text than visual–text inputs, highlighting the need for improved visual integration in scholarly document understanding.

Primary Area: datasets and benchmarks

Submission Number: 23397

Loading