Attention as Selector: Unlocking VLM Attention for Long Document Page Retrieval

Attention as Selector: Unlocking VLM Attention for Long Document Page Retrieval

ACL ARR 2026 January Submission5005 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Visual Document Retrieval, Visual Language Models, Cross-modal Attention

Abstract: Visual Language Models (VLMs) have become a robust foundation for document question answering. Processing long documents remains challenging due to limited context windows and computational budgets. Existing page-level retrieval methods offer a practical solution, typically encoding pages and queries into vectors and ranking them via cosine similarity. However, such embedding‑based methods (i) lack query–page interaction before similarity scoring and (ii) usually require large-scale datasets to align visual and textual embeddings. In this paper, we observe that the cross‑modal attention maps of well‑trained VLMs are able to highlight semantically relevant regions. Building on this insight, we present CAPS (Cross-modal Attention as Page Selector), a retrieval framework that utilizes attention mechanisms inside VLMs for page selection. Specifically, CAPS first enhances attention-based retrieval capability with a small amount of contrastive data, then identifies the most effective attention head through expert head selection, and finally employs an adaptive filtering mechanism to obtain an appropriate number of relevant page candidates. Extensive experiments on four long-document benchmarks demonstrate that CAPS outperforms state-of-the-art embedding‑based methods in both retrieval precision and downstream DocQA accuracy. Notably, CAPS achieves these gains using less than 10% of the training data required by competing baselines, highlighting the data efficiency of attention-based page retrieval.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: image text matching,vision question answering,cross-modal application

Languages Studied: English

Submission Number: 5005

Loading