Keywords: Visually Rich Document, Multimodal Learning, Multiple Hop
Abstract: Despite rapid progress in large language models (LLMs), current QA benchmarks still overlook the core challenge of real-world scientific information seeking: synthesizing multimodal evidence scattered across multiple documents and structural formats. Existing QAs remain narrow in scope, relying on unimodal text and short-span reasoning that fail to capture the complexity of real information-seeking. We introduce DocHop-QA, a benchmark of 11,379 instances for evaluating multimodal, multi-document, multi-hop scientific QA. Built from publicly available PubMed articles, DocHop-QA incorporates textual passages, tables, and layout cues, enabling cross-document inference without explicit hyperlinks. To scale realistic QA construction, we develop an LLM-driven generation pipeline grounded in 11 scientific reasoning concepts, producing diverse and coherent question–answer pairs. To highlight the dataset’s utility and versatility, we propose a task-driven evaluation framework spanning four settings, including structured index prediction, multimodal evidence integration, and generative answering. Experiments show that current models struggle with DocHop-QA’s long-context, multi-evidence demands, establishing it as a rigorous testbed for advancing next-generation scientific QA systems.
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: multimodal applications
Contribution Types: Data resources
Languages Studied: English
Submission Number: 5469
Loading