Abstract: The extraction and analysis of large numbers of glyphs, and the associated opportunities for constructing a corpus of glyphs from types of the fifteenth century, offer significant research potential for scholars in book science. Such a corpus could be used in many ways, not least in assisting in the identification of fragments, charting the movements of type, and examining the impact of wear on type. Recognising this potential, we have developed fang (Code is available at https://github.com/Werck-der-buecher/FAnG.), a software that efficiently extracts and categorises glyphs from historical printed documents. Our approach involves several stages: (1) using Optical Character Recognition to extract glyphs in bulk, (2) employing a joint energy-based model for character classification and out-of-distribution pruning, and (3) providing a comprehensive toolset for manual review and editing, including deletions/reassignments and sorting by similarity. A significant strength of this design is the utilisation of existing text transcriptions and the context-awareness of trained language models, eliminating the need for explicit glyph location ground truth or glyph templates. By parallelising the extraction, we can quickly process entire digitised books with hundreds of pages, setting our system apart from existing glyph annotation tools. In experiments on digital reproductions of the Catholicon and 36-line Bible, the method demonstrates good spatial coverage of the detected glyphs, high character classification accuracy, and yields a low number of outliers. Our system represents a significant advancement in historical document analysis, providing researchers with an efficient tool for glyph extraction and categorisation.
Loading