DocHop-QA: Towards Multi-Hop Reasoning over Multimodal Document Collections

DocHop-QA: Towards Multi-Hop Reasoning over Multimodal Document Collections

ACL ARR 2026 January Submission5469 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Visually Rich Document, Multimodal Learning, Multiple Hop

Abstract: Despite rapid progress in large language models (LLMs), current QA benchmarks still overlook the core challenge of real-world scientific information seeking: synthesizing multimodal evidence scattered across multiple documents and structural formats. Existing QAs remain narrow in scope, relying on unimodal text and short-span reasoning that fail to capture the complexity of real information-seeking. We introduce DocHop-QA, a benchmark of 11,379 instances for evaluating multimodal, multi-document, multi-hop scientific QA. Built from publicly available PubMed articles, DocHop-QA incorporates textual passages, tables, and layout cues, enabling cross-document inference without explicit hyperlinks. To scale realistic QA construction, we develop an LLM-driven generation pipeline grounded in 11 scientific reasoning concepts, producing diverse and coherent question–answer pairs. To highlight the dataset’s utility and versatility, we propose a task-driven evaluation framework spanning four settings, including structured index prediction, multimodal evidence integration, and generative answering. Experiments show that current models struggle with DocHop-QA’s long-context, multi-evidence demands, establishing it as a rigorous testbed for advancing next-generation scientific QA systems.

Paper Type: Long

Research Area: NLP Applications

Research Area Keywords: multimodal applications

Contribution Types: Data resources

Languages Studied: English

Submission Number: 5469

Loading