Abstract: Optical Character Recognition (OCR) plays a pivotal role in digitizing and analyzing text from physical documents. Despite advancements in OCR technologies, challenges persist in handling diverse text layouts, poorquality images, and complex fonts. In this paper, we present TokenOCR, an attention-based foundational
model designed to overcome these limitations by integrating convolutional neural networks and transformerbased architectures. Unlike traditional OCR models that predict individual characters, TokenOCR predicts
tokens, significantly enhancing recognition accuracy and efficiency. The model employs a ResNet50 feature
extractor, an encoder with adaptive 2D positional embeddings, and a decoder utilizing multi-headed attention
mechanisms for robust text recognition. To train TokenOCR, we used a dataset of six million images incorporating various real-world applications. Our training strategy integrates curriculum learning and adaptive
learning rate scheduling to ensure efficient model convergence and generalization. Comprehensive evaluations using Word Error Rate (WER) and Character Error Rate (CER) demonstrate that TokenOCR consistently
outperforms state-of-the-art models, including Tesseract and TrOCR, in both clean and degraded image conditions. These findings underscore TokenOCR’s potential to set new standards in OCR technology, offering a
scalable, efficient, and highly accurate solution for diverse text recognition applications.
Loading