Keywords: HTR, TTA, OOD
Abstract: Handwritten text recognition (HTR) converts images of handwritten text—from
lines to full pages—into accurate, machine-readable transcriptions. However, it
often operates under distribution shift—new writers, historical substrates, scan-
ning artifacts, layouts, and even cross-language use—precisely when target la-
bels and source data are unavailable. Although recent foundation models per-
form well on their training distributions, their generalization across domains is
fragile. Limitations in capacity, inadequate pretraining scale, or corpus–domain
mismatch frequently lead to pronounced errors, underscoring the need for effi-
cient adaptation even with state-of-the-art pretrained models. We fill this gap by
adapting a foundation model at test-time without labels or source data. To the
best of our knowledge, this is the first HTR test-time adaptation approach that
jointly optimizes a lightweight stroke-structure loss with a document-conditioned
language prior, rather than treating linguistic (LM decoding/reranking) and vi-
sual (self-training/normalization) cues separately. Evaluated on four benchmarks
(George Washington, IAM, RIMES, Bentham), our approach achieves an aver-
age absolute reduction of 0.0341 in CER and 0.0427 in WER, corresponding to
mean relative improvements of 20.8% and 12.8%, respectively. These findings
demonstrate that integrating lightweight visual and linguistic priors provides an
effective strategy for test-time adaptation in HTR.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 3193
Loading