DocGenome: A Large Benchmark for Multi-Modal Language Models in Real-World Academic Document Understanding

Renqiu Xia; Song Mao; Xiangchao Yan; Hongbin Zhou; Bo Zhang; Haoyang Peng; Jiahao Pi; Daocheng Fu; Wenjie Wu; Hancheng Ye; Shiyang Feng; Mingsheng Li; Bin Wang; Chao Xu; Conghui He; Pinlong Cai; Min Dou; Botian Shi; Sheng Zhou; Yongwei Wang; Bin Wang; Junchi Yan; Fei Wu; Yu Qiao

DocGenome: A Large Benchmark for Multi-Modal Language Models in Real-World Academic Document Understanding

26 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Scientific document structuring, Document understanding, Chart Table and Equation Understanding

TL;DR: We construct DocGenome, a structured document dataset covering annotated 500K scientific documents from 165 disciplines. We show that the performance of our model, trained on DocGenome, surpasses that of the closed-source commercial tools.

Abstract: Scientific documents record research findings and valuable human knowledge, comprising a vast corpus of high-quality data. Thus, leveraging multi-modality data extracted from these documents and assessing large models' abilities to handle scientific document-oriented tasks is meaningful. Despite promising advancements, large models still perform poorly on multi-page scientific document extraction and understanding tasks, and their capacity to process within-document data formats such as charts and equations remains under-explored. To address these issues, we present DocGenome, a structured document dataset constructed by annotating 500K scientific documents from 153 disciplines in the arXiv open-access community, using our custom auto-labeling pipeline. DocGenome features four characteristics: 1) Completeness: It is the first dataset to structure data from all modalities including 13 layout attributes along with their LaTeX source codes. 2) Logicality: It provides 6 logical relationships between different entities within each scientific document. 3) Diversity: It covers various document-oriented tasks, including document classification, visual grounding, document layout detection, document transformation, open-ended single-page QA and multi-page QA. 4) Correctness: It undergoes rigorous quality control checks conducted by a specialized team. We conduct extensive experiments to demonstrate the advantages of DocGenome and objectively evaluate the performance of current large models on our benchmark.

Primary Area: datasets and benchmarks

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 6842

Loading

DocGenome: A Large Benchmark for Multi-Modal Language Models in Real-World Academic Document Understanding

Renqiu Xia, Song Mao, Xiangchao Yan, Hongbin Zhou, Bo Zhang, Haoyang Peng, Jiahao Pi, Daocheng Fu, Wenjie Wu, Hancheng Ye, Shiyang Feng, Mingsheng Li, Bin Wang, Chao Xu, Conghui He, Pinlong Cai, Min Dou, Botian Shi, Sheng Zhou, Yongwei Wang et al. (4 additional authors not shown)