Breaking the annotation barrier with DocuLite: A scalable and privacy-preserving framework for financial document understanding
Keywords: financial/business NLP, multimodal applications
Abstract: In this paper, we introduce Doculite, a scalable and privacy-preserving framework for adapting large language models (LLM) and vision language models (VLM) to the task of information extraction from invoice documents with diverse layouts, without relying on human-annotated data. Doculite includes (a) InvoicePy, an LLM driven synthetic invoice generator in text domain for training LLMs for the task of information extraction from invoice documents which are processed via optical character recognition (OCR) models, and (b) TemplatePy, an HTML-based synthetic invoice template generator in the image domain for training VLMs for information extraction from invoice document images. We also curate "Challenging Invoice Extraction dataset" containing 184 real world invoices. The research is in collaboration with a Fintech startup that identifies itself as an "Agentic AI Platform for Finance and Accounting." Domain experts at the Fintech startup annotate the "Challenging Invoice Extraction dataset" and continuously evaluate the performance of LLM and VLM models trained using DocuLite. Experiments demonstrate that openchat-3.5-1210-7B LLM model trained with InvoicePy generated dataset achieves a 0.525 points improvement in the F1 score over the openchat-3.5-1210-7B LLM model trained with publicly available UCSF dataset on the "Challenging Invoice Extraction dataset". We also show that InternVL-2-8B VLM model trained with Templatepy generated dataset achieves a 0.513 points improvement in the F1 score over the InternVL-2-8B VLM model trained with publicly available UCSF dataset on the "Challenging Invoice Extraction dataset". To the best of our knowledge, Doculite is the first scalable and privacy preserving framework for adapting LLMs and VLMs for information extraction from invoice documents with diverse layouts.
Submission Number: 45
Loading