LOCR: Location-Guided Transformer for Optical Character Recognition

ACL ARR 2024 June Submission3388 Authors

16 Jun 2024 (modified: 18 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Academic documents are packed with texts, equations, tables, and figures, requiring comprehensive understanding for accurate Optical Character Recognition (OCR). While end-to-end OCR methods offer improved accuracy over layout-based approaches, they often grapple with significant repetition issues, especially with complex layouts in Out-Of-Domain (OOD) documents. To tackle this issue, we propose LOCR\footnote{Source codes and datasets will be available under the MIT license upon publication}, a model that integrates location guiding into the transformer architecture during autoregression. We train the model on an original large-scale dataset comprising over 53M text-location pairs from 89K academic document pages, including bounding boxes for words, tables and mathematical symbols. LOCR adeptly handles various formatting elements and generates content in Markdown language. It outperforms all existing methods in our test set constructed from arXiv, as measured by edit distance, BLEU, METEOR and F-measure. LOCR also eliminates repetition in the arXiv dataset, and reduces repetition frequency in OOD documents, from 13.19\% to 0.04\% and from 8.10\% to 0.11\% for natural science and social science documents respectively. Additionally, LOCR features an interactive OCR mode, facilitating the generation of complex documents through a few location prompts from human.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: cross-modal information extraction, cross-modal content generation, multimodality
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Reproduction study, Publicly available software and/or pre-trained models, Data resources
Languages Studied: English
Submission Number: 3388
Loading