Keywords: representation learning, clustering, supervised classification, diachronic analysis, handwritten character recognition, paleography, cultural heritage data
TL;DR: Domain-aware contrastive learning yields robust embeddings for evolving, low-resource handwriting data.
Abstract: Learning representations that remain robust across centuries of variation in handwriting is a key challenge in diachronic representation learning of ancient Greek manuscripts. We introduce three datasets of ancient Greek handwriting for diachronic representation learning: Hell-Char, a curated training set spanning the 3rd–1st centuries BCE, and two evaluation sets, PaLit-Char (1st–5th c. CE) and Med-Char (9th–14th c. CE). To address challenges of symbolic variation, scarce data, and systematic degradation, we propose two methodological innovations: a similarity-weighted supervised contrastive loss that biases embeddings by human-perceived confusability, and a lacuna-driven augmentation scheme that simulates realistic manuscript corruptions. Trained with these strategies, both a lightweight CNN and a pretrained ResNet achieve strong recognition performance and produce embeddings that more coherently separate character classes than PCA or generic pretrained models. These embeddings enable clustering, identification of stylistic subgroups, and construction of prototype images that visualize diachronic evolution and transitional letterforms. Our results demonstrate that incorporating expert priors and domain-specific corruptions yields robust, interpretable representations, offering a transferable paradigm for representation learning under scarce, temporally evolving, and noisy conditions.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 18890
Loading