Out of the Box: Zero-Shot Vision-Language Models for Redaction Detection and Page-Stream Segmentation
Keywords: Vision Language Models, OCR, Document Understanding, FOIA, Redaction Detection, Open Government, Page-Stream Segmentation
TL;DR: Zero-shot VLMs can detect redactions and segment document boundaries in Dutch government archives without task-specific training and prompt design is the primary performance lever
Abstract: Large collections of Dutch government documents are processed by OCR systems that extract plain text while discarding layout structure. Traditional OCR also struggles with visual redaction bars and cannot reliably split long page streams into separate documents. This paper investigates whether publicly available, modern Vision Language Models (VLMs) can address these limitations out of the box, and offer a unified approach to layout-preserving OCR, automated redaction detection, and page-stream segmentation (PSS).
We evaluate Nanonets OCR-S in a zero-shot setting and introduce a dual-pass inference framework: a \emph{text pass} that transcribes the page and tags redacted spans, and a \emph{visual pass} that counts redaction bars directly from the page image. We compare three different redaction-detection inference pipelines, with cross-modal gating methods that aim to combine strengths from both modalities. Our baseline achieves the most balanced behavior (text F1 = 0.384, visual F1 = 0.542).
Our first refined pipeline (V1) achieves a visual F1 of 0.574, a modest but real improvement over a count-level mean baseline (F1 = 0.479), though the comparison to task-specific models is limited by evaluation methodology: our VLM pipelines are evaluated on count accuracy only, not on redaction bar localization. For PSS, a refined prompt reaches F1 = 0.513 on the OpenPSS-LONG split, substantially outperforming a stratified random baseline (F1~$\approx$~0.238), illustrating the strong influence of prompt design.
Compared to task-specific models (Mask R-CNN for redaction detection, which achieves F1~$\approx$~0.95, and a multimodal ensemble for PSS, which achieves F1~$\approx$~0.85), our zero-shot VLM approach is less accurate, but does generalize well across tasks within a single, general-purpose model. Our results indicate that VLMs can effectively reason over textual and visual features in ways that traditional OCR cannot, although performance is constrained by prompt sensitivity, and operational feasibility is limited by computational cost.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 15
Loading