Keywords: Document Parsing, Reinforcement Learning, Reward Engineering, Multimodal
Abstract: Despite its success in Large Language Models, Reinforcement Learning (RL) remains underutilized in document because generic rewards fail to effectively evaluate various document elements such as formulas and tables. Existing metrics (e.g., Edit Distance) often miss the semantic validity of structures like LaTeX formulas or nested tables, while training dedicated Reward Models requires expensive human annotation. To bridge this gap, we introduce $\textbf{DocPO}$, a novel policy optimization framework featuring $\textbf{Tailored Step-Aware Rewards}$. Unlike generic approaches, DocPO constructs domain-specific reward functions without preference data: it integrates LLM-based semantic verification with syntactic constraints for formulas, utilizes structure-weighted TEDS for tables, and employs continuous distance metrics for text to mitigate sparsity. Additionally, we propose Step-Aware Annealing to dynamically modulate reward discriminability for distinguishing hard samples. Experiments show DocPO boosts parsing precision across elements, establishing a scalable, annotation-free paradigm for document understanding.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: Generation
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Publicly available software and/or pre-trained models
Languages Studied: English
Submission Number: 4045
Loading