Data Sources for Automatic Classification and Analysis of Texts from Egyptian Antiquity

University of Eastern Finland DRDHum 2024 Conference Submission61 Authors

Published: 03 Jun 2024, Last Modified: 16 Aug 2024DRDHum 2024 withRevisionsEveryoneRevisionsBibTeXCC BY 4.0
Keywords: corpus, egyptian, greek
TL;DR: In this poster, we present the aims and the current state of the research project "Automatic Classification and Analysis of Texts from Egyptian Antiquity", funded by the Kone Foundation.
Abstract: In this poster, we present the aims and the current state of the research project "Automatic Classification and Analysis of Texts from Egyptian Antiquity", funded by the Kone Foundation. In short, the project aims to develop new state-of-the-art language technological methods for automatically processing textual documents from Egypt dating from the 8th century BCE to the Arab conquest in the 7th century CE. The project investigates the extensive textual evidence from the region as a whole, including the texts in both the Greek and the Egyptian languages. A large part of the project is dedicated to collaboration between the project and various entities that own the copyright to the existing machine-readable texts within the focus of the research. We will identify sources for machine-readable texts pertinent to our study and, if they are not openly available, negotiate with the rightsholders for suitable access to the texts to use in the project. We will create a database of all sources where relevant machine-readable text collections are available. The listing will be openly available on the project's website and updated throughout the project's lifespan. We will contact the entities and persons behind the text collections and aim to get the data as exports from their system instead of reverting to methods like web scraping. We have already identified several sources for texts that are usable by the project. For the texts primarily written in Greek, we use all the transcribed texts available through the Papyri.info project as our data. Currently, the papyri.info collection contains metadata for over 100,000 texts, of which more than 50,000 are transcribed. In addition to the document data from papyri.info, we already have access to several thousand inscriptional texts from the Packard Humanities Institute's collection. Thesaurus Linguae Aegyptiae (TLA), a digital publication platform, includes machine-readable texts written in Egyptian using either Hieroglyphic, Hieratic, or Demotic scripts. The TLA is the largest ongoing project collecting and publishing machine-readable ancient Egyptian texts, and their collection is continuously increasing. We expect the latest form of the logographic Egyptian writing, Demotic, to be most interesting regarding language contact, as it was used while the Greeks ruled Egypt.
Submission Number: 61
Loading