HTR for Russian Empire Period Manuscripts: A Two-Stage Framework with New Annotated Resources

ICLR 2026 Conference Submission25555 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Handwritten Text Recognition, Low-Resource Languages, Historical Documents
TL;DR: First general HTR for pre-reform Russian handwriting (pre-1918): a two-stage YOLOv8 line segmenter + TrOCR recognizer that outperforms general-purpose HTR baselines on Imperial-era manuscripts.
Abstract: Historical handwritten documents represent a valuable source of information about the language, culture, and society of earlier periods. In the context of globalized scholarship, the development of automatic handwriting recognition tools for a wide range of languages has become increasingly important to ensure broader accessibility to the cultural heritage of different nations. Pre-revolutionary Russian presents a particular challenge for such systems due to its significant orthographic differences from the modern language. This work introduces a universal tool for recognizing handwritten documents written in pre-revolutionary Russian orthography, dated from the $19^{\mathrm{th}}$ century to the early $20^{\mathrm{th}}$ century. We present a two-stage handwritten text recognition (HTR) system combining YOLOv8-based line segmentation with TrOCR$_{pre}$, a transformer architecture pre-trained on Russian-language data. The system is performed on a manually annotated corpus of $38,501$ lines across three document types: Gubernatorial Reports ($31,083$ lines), Statutory Charters ($5,868$ lines), and Personal Diaries ($1,550$ lines), split into training, validation, and test sets. Our approach achieves a character error rate (CER) of $8.5$% and a word error rate (WER) of $29.1$% overall, with performance varying by document type - ranging from $4.8$% CER on formal administrative documents to $19.0$% CER on informal personal writings. The transformer-based architecture demonstrates a $53.8$% improvement over traditional CNN-RNN baselines (from $18.4$% to $8.5$%), providing a practical tool for large-scale digitization of historical Russian archives.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 25555
Loading