Abstract: This paper presents a novel task of extracting Latin fragments from mixed-language historical documents with varied layouts. We benchmark and evaluate the performance of large foundation models against a multimodal dataset of 753 annotated pages. The results demonstrate that reliable Latin detection with contemporary LLMs is achievable. Our study provides the first comprehensive analysis of these models' capabilities and limitations for this task.
Paper Type: Long
Research Area: Special Theme (conference specific)
Research Area Keywords: historical NLP, benchmarking, datasets for low resource languages, mixed language, multilingual extraction, cross-modal application
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches to low-resource settings, Publicly available software and/or pre-trained models, Data resources
Languages Studied: Latin
Submission Number: 4368
Loading