DocSplit: A Comprehensive Benchmark Dataset and Evaluation Approach for Document Packet Recognition and Splitting

Published: 18 Apr 2026, Last Modified: 26 Apr 2026ACL 2026 Industry Track PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Document Packet Splitting, Benchmark Dataset, Evaluation Metrics, Multimodal Document Comprehension
TL;DR: DocSplit presents the first comprehensive benchmark for document packet splitting, formalizing the DocPacSplit task and novel evaluation metrics that reveal performance gaps in large multimodal models.
Abstract: Document understanding in real-world applications often requires processing heterogeneous, multi-page document packets containing multiple documents stitched together. Despite recent advances in visual document understanding, the fundamental task of document packet splitting, which involves separating a document packet into individual units, remains largely unaddressed. We present the first comprehensive benchmark dataset, $\textit{DocSplit}$, along with novel evaluation metrics for assessing the document packet splitting capabilities of large language models. $\textit{DocSplit}$ comprises five datasets of varying complexity, covering diverse document types, layouts, and multimodal settings. We formalize the $\textit{DocSplit}$ task, which requires models to identify document boundaries, classify document types, and maintain correct page ordering within a document packet. The benchmark addresses real-world challenges, including out-of-order pages, interleaved documents, and documents lacking clear demarcations. We conduct extensive experiments evaluating multimodal LLMs on our datasets, revealing significant performance gaps in current models' ability to handle complex document splitting tasks. The $\textit{DocSplit}$ benchmark datasets and proposed novel evaluation metrics provide a systematic framework for advancing document understanding capabilities essential for legal, financial, healthcare, and other document-intensive domains. We release the datasets and evaluation code to facilitate future research in document packet processing.
Submission Type: Emerging
Copyright Form: pdf
Submission Number: 29
Loading