ArabiDoc: A Holistic Arabic-English Evaluation Suite for End-to-End Document Processing

ICLR 2026 Conference Submission21348 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: End-to-end document parsing, Bilingual Benchmark
Abstract: Document intelligence sits at the intersection of computer vision and natural language processing, where the goal is to transform complex real-world documents into structured, machine-readable representations. While progress has been made, current benchmarks for low-resource languages such as Arabic remain limited, typically emphasizing individual components, like text, tables, or charts, without providing a comprehensive evaluation of full-document parsing. To address this gap, we present a new bilingual (Arabic-English) benchmark that brings together diverse document elements within a single evaluation framework for end-to-end document parsing. Our benchmark offers three main contributions. First, it preserves reading order information, which allows models to better capture the natural flow of documents. Second, it supports visual content parsing, encompassing not only text blocks but also tables, charts, and figures, thereby reflecting the full range of document structures. Third, it introduces relaxed evaluation metrics that more fairly assess model performance by tolerating minor deviations in reading order or localized errors in table and chart parsing, ensuring the evaluation reflects practical usability rather than strict exactness. Constructed through a two-step annotation process, layout segmentation followed by object-level labeling, our dataset includes 137 pages that have been carefully segmented and verified by human annotators. By unifying previously separate evaluation tracks, this benchmark establishes the first comprehensive standard for structured document parsing in Arabic and provides a more realistic basis for bilingual evaluation with English. We expect this resource to foster progress in multimodal reasoning, enable stronger baselines, and support the development of vision-language models that generalize robustly across languages and document types.
Supplementary Material: pdf
Primary Area: datasets and benchmarks
Submission Number: 21348
Loading