QARI-OCR: High-Fidelity Arabic Text Recognition through Multimodal Large Language Model Adaptation

QARI-OCR: High-Fidelity Arabic Text Recognition through Multimodal Large Language Model Adaptation

ACL ARR 2026 January Submission6443 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Arabic OCR; Vision–Language Models; Synthetic Data Augmentation;BLEU Score; Document Layout Analysis

Abstract: The inherent complexities of Arabic script—its cursive nature, diacritical marks (\emph{tashkīl}), and varied typography—pose persistent challenges for Optical Character Recognition (OCR). We present Qari-OCR, a series of vision-language models derived from Qwen2-VL-2B-Instruct, progressively optimized for Arabic through iterative fine-tuning on specialized synthetic datasets. Our leading model, QARI v0.2 achieves a the strongest performance with a Word Error Rate (WER) of 0.160, Character Error Rate (CER) of 0.061, and BLEU score of 0.737 on diacritically-rich texts. Qari-OCR demonstrates the strongest handling of \emph{tashkīl}, diverse fonts, and document layouts, alongside impressive performance on low-resolution images. Further explorations (QARI v0.3) showcase strong potential for structural document understanding and handwritten text. This work delivers a marked improvement in Arabic OCR accuracy and efficiency, with all models and datasets released to foster further research.

Paper Type: Long

Research Area: Information Extraction and Retrieval

Research Area Keywords: Multimodality and Language Grounding to Vision, Robotics and Beyond, Information Extraction, NLP Applications

Contribution Types: Model analysis & interpretability, Approaches to low-resource settings, Approaches low compute settings-efficiency, Publicly available software and/or pre-trained models, Data resources

Languages Studied: Arabic

Submission Number: 6443

Loading