SAND: A Large-Scale Synthetic Arabic OCR Corpus for Vision-Language Models

ACL ARR 2025 February Submission3787 Authors

15 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Arabic Optical Character Recognition (OCR) plays a vital role in digitizing Arabic text, yet existing datasets are often limited in scale, diversity, and structured formatting. Most available datasets focus on either printed or handwritten text, lacking the scalability and controlled variation needed for training robust OCR models on book-style documents. To address this gap, we introduce SAND (Large-Scale Synthetic Arabic OCR Dataset), a large-scale, synthetically generated Arabic OCR dataset designed to reflect real-world book formatting. SAND comprises 743,000 document images containing 662.15 million words, spanning five distinct Arabic fonts to enhance typographic diversity and generalization. Unlike traditional datasets, Unlike traditional datasets, SAND offers a scalable and structured resource for training OCR and vision-language models. Its synthetic nature ensures controlled variation in fonts, formatting, and text structures while eliminating common artifacts such as noise, blur, and distortions. This enhances OCR model generalization across diverse Arabic text styles.
Paper Type: Short
Research Area: Resources and Evaluation
Research Area Keywords: Corpus creation, Benchmarking, Language resources, Evaluation methodologies
Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models, Data resources, Data analysis
Languages Studied: Arabic
Submission Number: 3787
Loading