Flow-Based Page Unique Semantic Mapping Architecture for Document Visual Question Answering

Flow-Based Page Unique Semantic Mapping Architecture for Document Visual Question Answering

ACL ARR 2026 January Submission7240 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Unique Semantic Mapping, Flow-Basd Model, Document Visual Question Answering, Multi-Page Document Understanding

Abstract: Document Visual Question Answering (DocVQA) aims to generate answers by jointly understanding the textual, layout, and visual elements within document images. Although end-to-end vision-based generative methods have reduced dependency on OCR, they still struggle to achieve precise evidence localization when page semantics are complex and highly similar. However, existing research lacks an in-depth theoretical analysis of the question-driven semantic representation space, failing to fundamentally address the distinguishability problem among semantically similar pages. To fill this theoretical gap, we propose and prove that, given a specific question, each page possesses a unique semantic representation, and there exists a bijective mapping between the page and its unique semantics. Based on this theoretical foundation, we introduce the \textbf{F}low-Based Page \textbf{U}nique Semantic \textbf{M}apping \textbf{A}rchitecture (\textbf{FUMA}), which reconstructs evidence localization from similarity-based retrieval into precise selection on unique semantics. FUMA employs fine-grained cross-modal attention to extract discriminative cues and utilizes flow-based reversible transformations with likelihood regularization to learn bijective mappings, ensuring that each page obtains a unique semantic representation. Moreover, a multi-expert collaboration mechanism complementarily models fine-grained multimodal information within each page, achieving robust answer generation. Experimental results demonstrate that FUMA significantly outperforms existing methods in both evidence localization and answer generation.

Paper Type: Long

Research Area: Question Answering

Research Area Keywords: multimodal QA;generalization;

Contribution Types: Model analysis & interpretability, Theory

Languages Studied: English

Submission Number: 7240

Loading