Towards Better Dissemination and Preservation: End-to-End Chinese Historical Document Digitization

ACL ARR 2026 January Submission241 Authors

22 Dec 2025 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: resources for less-resourced languages, less-resourced languages, endangered languages
Abstract: Historical documents serve as the carrier of massive Chinese history and culture. Increasing works try to digitize historical documents by recognizing the context of books with Optical Character Recognition (OCR) for better preservation and propagation. However, previous works are unpractical for digitization since they focused on isolated fundamental tasks, such as single character recognition or line detection, whereas their outputs are low-level components such as isolated characters instead of readable context, can not fulfill the applicable digitization. To this end, we introduce the first end-to-end benchmark for digitizing Chinese historical documents, targeting well-formatted and human-readable outputs. This task is challenging due to the visual variability such as diverse page layouts and the need for deep textual understanding to maintain semantic coherence and consistency. To address these issues, we propose two complementary components: 1) Document Image Augmentation tailored to simulate visual artifacts and layout diversity. 2) Correction-Based Post-Editing that corrects textual errors to enforce semantic coherence. Experiments demonstrate the advantage of our proposed model over cutting-edge baselines, underscoring the necessity of introducing this new setting, thereby facilitating a solid precondition for protecting and propagating the already scarce resources.
Paper Type: Long
Research Area: Multilinguality and Language Diversity
Research Area Keywords: resources for less-resourced languages, less-resourced languages, endangered languages
Contribution Types: Approaches to low-resource settings, Data resources
Languages Studied: Chinese
Submission Number: 241
Loading