DocQIR-Emb: Document Image Retrieval with Multi-lingual Question Query

Published: 23 Sept 2025, Last Modified: 17 Nov 2025UniReps2025EveryoneRevisionsBibTeXCC BY-NC 4.0
Supplementary Material: pdf
Track: Proceedings Track
Keywords: Document image retrieval, Multi-lingual text embedder, Visual language model
Abstract: Document image retrieval is a fundamental task for improving document understanding, where the goal is to retrieve relevant images in the document and to answer the question from the user. Unlike other text-to-image tasks, which mainly focus on the alignment between image caption and natural image, document image retrieval requires the model to understand the question from user and return related table image or scientific image. The significant domain difference between image caption and user question, as well as natural image and scientific images, prevents the off-the-shelf retrieval model from becoming applicable. To systematically study the degradation, we curate a novel multi-lingual Document Question-Image Retrieval benchmark, DocQIR, that covers questions in 5 different languages. Our preliminary study shows that off-the-shelf retrieval models fail to retrieve documents images when questions in various languages are presented. To address this issue, we proposed a novel architecture, DocQIR-Emb, that leverages a multi-lingual text embedder and a VLM to encode a question and an image into a shared feature space. Since the multi-lingual embedder is trained to align text in different languages, the text embedder is frozen and only the VLM is optimized. Experiments show that DocQIR-Emb outperforms the baseline by at least 40% on the proposed DocQIR dataset and the gain is consistent across table image and scientific image. Different architecture designs are also ablated to demonstrate the effectiveness of DocQIR-Emb.
Submission Number: 8
Loading