Keywords: Unique Semantic Mapping, Flow-Basd Model, Document Visual Question Answering, Multi-Page Document Understanding
Abstract: Document Visual Question Answering (DocVQA) aims to generate answers by jointly understanding the textual, layout, and visual elements within document images.
Although end-to-end vision-based generative methods have reduced dependency on OCR, they still struggle to achieve precise evidence localization when page semantics are complex and highly similar.
However, existing research lacks an in-depth theoretical analysis of the question-driven semantic representation space, failing to fundamentally address the distinguishability problem among semantically similar pages.
To fill this theoretical gap, we propose and prove that, given a specific question, each page possesses a unique semantic representation, and there exists a bijective mapping between the page and its unique semantics.
Based on this theoretical foundation, we introduce the \textbf{F}low-Based Page \textbf{U}nique Semantic \textbf{M}apping \textbf{A}rchitecture (\textbf{FUMA}), which reconstructs evidence localization from similarity-based retrieval into precise selection on unique semantics.
FUMA employs fine-grained cross-modal attention to extract discriminative cues and utilizes flow-based reversible transformations with likelihood regularization to learn bijective mappings, ensuring that each page obtains a unique semantic representation.
Moreover, a multi-expert collaboration mechanism complementarily models fine-grained multimodal information within each page, achieving robust answer generation.
Experimental results demonstrate that FUMA significantly outperforms existing methods in both evidence localization and answer generation.
Paper Type: Long
Research Area: Question Answering
Research Area Keywords: multimodal QA;generalization;
Contribution Types: Model analysis & interpretability, Theory
Languages Studied: English
Submission Number: 7240
Loading