TokenOCR: An Attention Based Foundational Model for Intelligent Optical Character Recognition

Charith Gunasekara, Zachary Hamel, Feng Du, Connor Baillie

Published: 23 Feb 2025, Last Modified: 27 Jan 2026ICPRAMEveryoneCC BY 4.0

Abstract: Optical Character Recognition (OCR) plays a pivotal role in digitizing and analyzing text from physical documents. Despite advancements in OCR technologies, challenges persist in handling diverse text layouts, poorquality images, and complex fonts. In this paper, we present TokenOCR, an attention-based foundational model designed to overcome these limitations by integrating convolutional neural networks and transformerbased architectures. Unlike traditional OCR models that predict individual characters, TokenOCR predicts tokens, significantly enhancing recognition accuracy and efficiency. The model employs a ResNet50 feature extractor, an encoder with adaptive 2D positional embeddings, and a decoder utilizing multi-headed attention mechanisms for robust text recognition. To train TokenOCR, we used a dataset of six million images incorporating various real-world applications. Our training strategy integrates curriculum learning and adaptive learning rate scheduling to ensure efficient model convergence and generalization. Comprehensive evaluations using Word Error Rate (WER) and Character Error Rate (CER) demonstrate that TokenOCR consistently outperforms state-of-the-art models, including Tesseract and TrOCR, in both clean and degraded image conditions. These findings underscore TokenOCR’s potential to set new standards in OCR technology, offering a scalable, efficient, and highly accurate solution for diverse text recognition applications.