Data Augmentation for Historical NER: A Systematic Comparison of Lexical and LLM-based Approaches

19 Mar 2026 (modified: 19 May 2026)SwissText 2026 Conference SubmissionEveryoneRevisionsCC BY 4.0
Track: Scientific Track
Keywords: Named Entity Recognition, Data Augmentation, Historical Documents, Mention Replacement, Silver Data, LLM Annotation
Abstract: Named Entity Recognition (NER) on historical materials suffers significant performance degradation compared with modern text, owing to optical character recognition (OCR) errors, language evolution, and scarce annotated training data. Although various remedies have been explored to increase robustness and generalization, data augmentation techniques, despite their proven effectiveness on modern NER benchmarks, remain largely unexplored in the historical setting. This article investigates data augmentation strategies for historical NER through a systematic comparison of two complementary approaches: intrinsic augmentation via mention replacement and extrinsic augmentation through large language model (LLM)-based corpus annotation. We experiment with different augmentation variants and corpus sizes on French and German Swiss historical newspapers. Our results show contrasting patterns: mention replacement yields stable improvements across settings, whereas LLM-based silver data is most useful at moderate scale and when quality-filtered, but its effectiveness degrades as additional pseudo-labeled data is introduced. Overall, simple lexical augmentation emerges as the more robust strategy for historical NER, while LLM-based approaches remain sensitive to annotation noise and data shift.
Submission Number: 39
Loading