Abstract: Multimodal document retrieval aims to retrieve query-relevant components from documents composed of textual, tabular, and visual elements.
An effective multimodal retriever needs to handle two main challenges:
(1) mitigate the effect of irrelevant contents caused by fixed, single-granular retrieval units, and
(2) support multihop reasoning by effectively capturing semantic relationships among components within and across documents.
To address these challenges, we propose LILaC, a multimodal retrieval framework featuring two core innovations.
First, we introduce a layered component graph, explicitly representing multimodal information at two layers---each representing coarse and fine granularity---facilitating efficient yet precise reasoning.
Second, we develop a late-interaction-based subgraph retrieval method, an edge-based approach that initially identifies coarse-grained nodes for efficient candidate generation, then performs fine-grained reasoning via late interaction.
Extensive experiments demonstrate that LILaC achieves state-of-the-art retrieval performance on four out of five benchmarks, notably without additional fine-tuning.
Paper Type: Long
Research Area: Question Answering
Research Area Keywords: Multimodal Document Retrieval, Multihop Reasoning, Late Interaction
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 7723
Loading