ViG-LLM: Enhancing Visual Grounding Capabilities in Closed-Box LLMs for Document Information Extraction without OCR Dependencies
Keywords: Visual Grounding, Large Language Models (LLMs), Document Information Extraction, OCR-independence, Multi-agent System, Human-in-the-loop Learning (HITL)
Abstract: Large Language Models (LLMs) have shown remarkable capabilities in document processing, but their inability to provide visual grounding without OCR dependencies poses significant challenges in business-critical applications. Current solutions either require model fine-tuning or rely on external OCR services, introducing additional costs, latency, and limitations in handling derived information. This paper presents ViG-LLM, a novel framework that enables closed-box LLMs to generate localization information through a multi-agent system combining U-Net-based layout deconstruction with viewport identification tasks. Evaluated on the FATURA and CORD dataset, our framework achieves perfect accuracy over spatial reasoning tuned LLM like Amazon Nova Pro, while demonstrating superior template-specific consistency. The framework maintains robust performance across LLM architectures while reducing operational costs by 60% compared to Textract-based solutions. In real-world document processing applications, the framework helps retain the high reasoning capabilities of the system in document information extraction tasks while improving explainability, reliability and human interaction for information verification. Through human-in-the-loop learning and closed-box prompt alignment techniques, ViG-LLM provides a robust, adaptable solution for visual grounding tasks in document processing workflows.
Submission Number: 3
Loading