Scribes, Scripts, and Scarcity: Re-thinking Benchmarking for Arabic-Script Handwritten Text Recognition in Historical Manuscript Traditions

Published: 14 Dec 2025, Last Modified: 14 Dec 2025LM4UC@AAAI2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Arabic-script manuscripts, HTR, Benchmarking, Paleography, Low-resource NLP, Underrepresented writing systems
TL;DR: Modern benchmarks misrepresent Arabic-script writing traditions; we propose a historically grounded alternative.
Abstract: Arabic-script manuscript traditions represent vast historical textual worlds that remain difficult to access through contemporary NLP technologies. Although recent advances in handwritten text recognition (HTR) have improved transcription of some Arabic-script materials, widely used benchmarks still rely heavily on modern handwriting, printed text, or small and relatively homogeneous manuscript subsets. These evaluation regimes capture only a narrow slice of the visual, scribal, and linguistic diversity found in historical documents. As a result, models often perform well on benchmark tasks while generalizing poorly to real archival settings, particularly in under-resourced institutions and communities. This paper argues for historically informed, materially grounded approaches to evaluating Arabic-script HTR. Drawing on examples from Ottoman, Persian, and Arabic manuscript cultures, it diagnoses common abstraction patterns in current benchmarks and propose a four-part taxonomy—scribal variation, material degradation, layout and paratext, and linguistic–morphological complexity—to guide future evaluation design. It then outlines the guiding principles for benchmarks that more faithfully represent historical manuscript conditions. The goal of the paper is to support the development of HTR evaluation frameworks that are both culturally sensitive and better aligned with the needs of those working with underserved manuscript traditions.
Submission Number: 23
Loading