MIMIC-VQA: COMPILING AGENTIC REASONERS INTO EFFICIENT DOCUMENT VQA MODELS

ICLR 2026 Conference Submission13577 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Document Visual Question Answering, VLM, Document Understanding
Abstract: Document Visual Question Answering systems face a fundamental architectural dichotomy: modular agentic frameworks decompose problems into interpretable sub-tasks but incur prohibitive inference latency through sequential tool orchestration, while monolithic end-to-end models achieve computational efficiency at the cost of reasoning transparency and spatial grounding capabilities. We present MIMIC-VQA, a knowledge distillation framework that transcends this trade-off by compiling the procedural reasoning of expert agents into efficient neural architectures. Our approach operates through a two-phase paradigm: first, a teacher pipeline orchestrated by Llama 4 Scout generates 102,447 Chain-of-Thought reasoning traces that explicitly encode multi-step problem decomposition, contextual retrieval, and deterministic spatial grounding; second, these traces train a pruned 9B-parameter student model derived from Gemma 3-27B to replicate the complete reasoning process—including intermediate steps and bounding box coordinates—within a single autoregressive generation. This procedural distillation enables the student to internalize the teacher's tool-based reasoning methodology while eliminating runtime dependencies on external components. Empirically, MIMIC-VQA achieves state-of-the-art performance across DocVQA (89.7 ANLS), VisualMRC, FUNSD, and CORD benchmarks, demonstrating 20-30 point improvements in spatial grounding (mAP@IoU) over existing methods while operating 5.3× faster than the teacher system. The framework maintains 98.3% of teacher accuracy despite 66% parameter reduction, validating that complex multi-agent reasoning can be successfully compiled into compact neural representations. By treating sophisticated agentic systems as data generators rather than deployment models, MIMIC-VQA establishes a practical paradigm for scaling document understanding capabilities without prohibitive infrastructure costs. The dataset of reasoning traces and the official implementation are publicly available at: https://anonymous.4open.science/r/MIMIC-B5DF.
Primary Area: interpretability and explainable AI
Submission Number: 13577
Loading