A Multistage Extraction Pipeline for Long Scanned Financial Documents: An Empirical Study in Industrial KYC Workflows

Han Yuxuan; Yuanxing Zhang; Yushuo Wang; Yichao Jin

A Multistage Extraction Pipeline for Long Scanned Financial Documents: An Empirical Study in Industrial KYC Workflows

Han Yuxuan, Yuanxing Zhang, Yushuo Wang, Yichao Jin

Published: 18 Apr 2026, Last Modified: 23 Apr 2026ACL 2026 Industry Track PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multistage Document Understanding Vision–Language Models, Financial Document Extraction, KYC Automation, Long-Document Processing

TL;DR: We propose a multistage OCR–retrieval–VLM pipeline for long, multilingual scanned financial documents that improves extraction accuracy by up to 31.9 points over direct PDF-to-VLM baselines while maintaining comparable service latency.

Abstract: Structured information extraction from long, multilingual scanned financial documents is a core requirement in industrial KYC and compliance workflows. These documents are typically non-machine-readable, noisy, and visually heterogeneous. They usually span dozens of pages while containing only sparse task-relevant information. Although recent vision–language models (VLMs) achieve strong benchmark performance, directly applying them end-to-end to full financial reports often leads to unreliable extraction under real-world conditions. We present a multistage extraction framework that integrates image preprocessing, multilingual OCR, hybrid page-level retrieval, and compact VLM-based structured extraction. The design separates page localization from multimodal reasoning, enabling more accurate extraction from complex multi-page documents. We evaluated the framework on 120 production KYC documents comprising about 3000 multilingual scanned pages. Across multiple OCR–VLM combinations, the proposed pipeline consistently outperforms direct PDF-to-VLM baselines, improving field-level accuracy by up to 31.9 percentage points. The best configuration, PaddleOCR with MiniCPM-o-2.6, achieves 87.27% accuracy. Ablation studies show that page-level retrieval is the dominant factor in performance improvements, particularly for complex financial statements and non-English documents.

Submission Type: Deployed

Copyright Form: pdf

Submission Number: 347

Loading