From $\textit{Pagin\={a}}$ to Webpage: On Developing and Documenting a Digitized Latin Collection

Stephen Bothwell; Kaitlin Stephan; Hildegund Muller; David Chiang

From $\textit{Pagin\={a}}$ to Webpage: On Developing and Documenting a Digitized Latin Collection

Stephen Bothwell, Kaitlin Stephan, Hildegund Muller, David Chiang

Published: 28 Apr 2026, Last Modified: 28 Apr 2026MSLD 2026 PosterEveryoneRevisionsCC BY 4.0

Keywords: Latin, digitization, TEI, OCR, post-correction

TL;DR: This paper describes the creation of three distinct Latin language resources as well as the steps taken to ensure their computational accessibility and high quality.

Abstract: In this work, we present three Zenodo repositories related to the creation of digital editions for Latin texts. The first is the Notre Dame Digitized Latin Collection (ND-DLC), which contains over 550,000 words of Latin in TEI-XML. The second is the Corpus Correctum ($\text{Cor}^{2}$), a dataset offering 3.4 million characters’ worth of data in TSV, PNG, and TXT formats for training optical character recognition (OCR) and post-OCR correction systems. The third is ND-DLC-Tools: a set of Python scripts for reproducing our digitization workflow. Together, these repositories make many Latin texts computationally accessible and provide resources to bolster digitization efforts.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 94

Loading