Detecting Latin in Historical Books with Varied Layout: A Multimodal Benchmark

Detecting Latin in Historical Books with Varied Layout: A Multimodal Benchmark

ACL ARR 2025 May Submission4368 Authors

19 May 2025 (modified: 29 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: This paper presents a novel task of extracting Latin fragments from mixed-language historical documents with varied layouts. We benchmark and evaluate the performance of large foundation models against a multimodal dataset of 753 annotated pages. The results demonstrate that reliable Latin detection with contemporary LLMs is achievable. Our study provides the first comprehensive analysis of these models' capabilities and limitations for this task.

Paper Type: Long

Research Area: Special Theme (conference specific)

Research Area Keywords: historical NLP, benchmarking, datasets for low resource languages, mixed language, multilingual extraction, cross-modal application

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches to low-resource settings, Publicly available software and/or pre-trained models, Data resources

Languages Studied: Latin

Submission Number: 4368

Loading