Dynamic, Coherent, and Lossless: The Iterative Structured Content Filling Framework for Infinite Document Parsing
Keywords: Iterative Structured Content Filling ;Dynamic Boundary Re-alignment ; Unstructured document parsing; LLM context limitation
Abstract: Parsing long, unstructured documents into hierarchical machine-readable formats is critically challenged by the tension between document length and limited LLM context. Prevailing one-shot or stateless chunk-based methods suffer from the "lost-in-the-middle" phenomenon and fragmented structures. We argue for reframing the task as an iterative, stateful editing process. This paper introduces the Iterative Structured Content Filling (ISCF) framework, which incrementally builds a global tree through LLM-driven steps under partial observability. ISCF integrates three pillars: (1) Dynamic Boundary Re-alignment (DBR) for unsupervised chunk optimization via semantic coherence and uncertainty; (2) Adaptive Structural Masking & Hierarchical-Aware Path Embedding (HAPE) for efficient global awareness within finite context; (3) Bijective Path-Content Mapping ensuring lossless reconstruction. Extensive experiments on financial and legal documents demonstrate ISCF's superior accuracy, robustness, and efficiency over strong baselines, offering a principled solution for unbounded document structuring.
Paper Type: Long
Research Area: Hierarchical Structure Prediction, Syntax, and Parsing
Research Area Keywords: Syntax: Tagging, Chunking and Parsing
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 3065
Loading