From $\textit{Pagin\={a}}$ to Webpage: On Developing and Documenting a Digitized Latin Collection

Published: 28 Apr 2026, Last Modified: 28 Apr 2026MSLD 2026 PosterEveryoneRevisionsCC BY 4.0
Keywords: Latin, digitization, TEI, OCR, post-correction
TL;DR: This paper describes the creation of three distinct Latin language resources as well as the steps taken to ensure their computational accessibility and high quality.
Abstract: In this work, we present three Zenodo repositories related to the creation of digital editions for Latin texts. The first is the Notre Dame Digitized Latin Collection (ND-DLC), which contains over 550,000 words of Latin in TEI-XML. The second is the Corpus Correctum ($\text{Cor}^{2}$), a dataset offering 3.4 million characters’ worth of data in TSV, PNG, and TXT formats for training optical character recognition (OCR) and post-OCR correction systems. The third is ND-DLC-Tools: a set of Python scripts for reproducing our digitization workflow. Together, these repositories make many Latin texts computationally accessible and provide resources to bolster digitization efforts.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 94
Loading