Abstract: Most existing document visual question answering (DocVQA) methods are restricted to single-page documents, limiting their applicability to more common multi-page scenarios. We introduce MP-FIRE, a multi-page DocVQA framework that integrates graph pruning-based reinforcement and cross-modal agent ensemble. MP-FIRE overcomes the Transformer’s inherent input length limitation, enabling the processing of an unlimited number of document pages in a single pass. Specifically, MP-FIRE employs a topology graph to extract document features and applies a two-stage pruning process to eliminate irrelevant document elements. It leverages the Performance Characterization Spectrum (PCS) to form a cross-modal agent ensemble, thereby enhancing complementary strengths and improving overall performance. Experimental results on DUDE and MP-DocVQA demonstrate MP-FIRE’s state-of-the-art performance, while ensuring strong generalization and fault tolerance.
External IDs:dblp:conf/icmcs/YuZZ25
Loading