DocPruner: A Storage-Efficient Framework for Multi-Vector Visual Document Retrieval via Adaptive Patch-Level Embedding Pruning

ICLR 2026 Conference Submission15588 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Visual Document Retrieval, Multi-Vector Retrieval, Multimodal LLMs
TL;DR: DocPruner is a framework that adaptively prunes redundant patch-level embeddings, slashing storage overhead by 50-60% for large-scale visual document retrieval systems while maintaining accuracy.
Abstract: Visual Document Retrieval (VDR), the task of retrieving visually-rich document pages using queries that combine visual and textual cues, is crucial for numerous real-world applications. Recent state-of-the-art methods leverage Large Vision-Language Models (LVLMs) in a multi-vector paradigm, representing each document as patch-level embeddings to capture fine-grained details. While highly effective, this approach introduces a critical challenge: prohibitive storage overhead, as storing hundreds of vectors per page makes large-scale deployment costly and impractical. To address this, we introduce **DocPruner, the first framework to employ adaptive patch-level embedding pruning for VDR to effectively reduce the storage overhead**. DocPruner leverages the intra-document patch attention distribution to dynamically identify and discard redundant embeddings for each document. This adaptive mechanism enables a significant 50-60% reduction in storage for leading multi-vector VDR models with negligible degradation in retrieval performance. Extensive experiments across more than ten benchmark datasets validate that DocPruner offers a robust, flexible, and effective solution for building storage-efficient, large-scale VDR systems.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 15588
Loading