Keywords: Multimodal Algorithmic Reasoning, Supervised Fine-Tuning, Reinforcement Learning, Chain-of-Thought, Document Understanding, Cross-Modal Reasoning, Citation Attribution, Document Reranking, Explainable AI, Foundation Models, Few-Shot Learning, Structured Reasoning
TL;DR: TRACE enables VLMs to maintain robust reasoning and accurate source attribution across 150-page documents through efficient two-stage training with balanced rewards.
Abstract: Current Vision-Language Models (VLMs) exhibit severe performance degradation when processing extended multimodal document contexts, declining from $\sim$87\% accuracy on short contexts (1-10 pages) to $\sim$18\% on long contexts (150 pages). This fundamental limitation severely restricts their applicability to real-world document intelligence tasks requiring multi-page reasoning. We introduce \textbf{TRACE} (Transparent Reasoning and Attribution Chains for Extended Multimodal Contexts), a novel training framework that enables VLMs to maintain robust reasoning performance across 10-150 document pages through structured chain-of-thought generation with accurate source attribution. Our approach combines: (1) a synthetic data generation pipeline producing 500K high-quality long-context document instances with reasoning traces and page-level citations, (2) a two-stage training methodology integrating Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO), and (3) specialized reward functions that jointly optimize answer accuracy, citation precision, and reasoning coherence. Extensive experiments on Document Visual Question Answering and document reranking tasks demonstrate that TRACE achieves 91-203\% improvement over baseline VLMs at 150-page contexts, with SFT providing 40-50\% gains and reinforcement learning contributing an additional 10-20\% enhancement. Our work directly addresses multimodal algorithmic reasoning challenges by enabling models to automatically derive structured reasoning procedures for complex visual-textual document analysis tasks.
Submission Number: 284
Loading