Judge a Book by its Cover: Investigating Multi-Modal LLMs for Multi-Page Handwritten Document Transcription

Benjamin Gutteridge; Matthew Thomas Jackson; Toni Kukurin; Xiaowen Dong

Judge a Book by its Cover: Investigating Multi-Modal LLMs for Multi-Page Handwritten Document Transcription

Benjamin Gutteridge, Matthew Thomas Jackson, Toni Kukurin, Xiaowen Dong

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large language models, document processing, handwriting transcription

TL;DR: An investigation into the use of multi-modal large language models alongside OCR engines for transcribing multi-page handwritten documents in a zero-shot setting

Abstract: Handwriting text recognition (HTR) remains a challenging task. Existing approaches require fine-tuning on labeled data, which is impractical to obtain for real-world problems, or rely on zero-shot tools such as OCR engines and multi-modal LLMs (MLLMs). MLLMs have shown promise both as end-to-end transcribers and as OCR post-processors, but to date there is little empirical research evaluating different MLLM prompting strategies for HTR, particularly for the case of *multi-page documents*. Most handwritten documents are multi-page, and share context such as semantic content and handwriting style across pages, yet MLLMs are typically used for transcription at the page level, meaning they throw away this shared context. They are also typically used as either as text-only post-processors or image-only OCR alternatives, rather than leveraging multiple modes. This paper investigates a suite of methods combining OCR, LLM post-processing and MLLM end-to-end transcription, for the task of zero-shot multi-page handwritten document transcription. We introduce a benchmark for this task from existing single-page datasets, including a new dataset, `Malvern-Hills`. Finally, we introduce **OCR+PAGE-1** and **OCR+PAGE-N**, prompting strategies for multi-page transcription that outperform existing methods by sharing content across pages while minimizing prompt complexity.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 24499

Loading