MarkedDPR: Enhancing Dense Passage Retrieval with Exact Match Signals and Synthetic Data Augmentation

Published: 01 Jan 2024, Last Modified: 16 May 2025WISE (1) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Recent neural IR models, such as the Dense Passage Retriever (DPR) model, often overlook exact matching signals, focusing primarily on semantic matching between queries and documents. This may lead to the omission of relevant documents containing key query terms. In this paper, we firstly introduce MarkedDPR, an extension of the DPR model that explicitly integrates exact match signals into the relevance estimation process. MarkedDPR employs a marking strategy to highlight exact matches for each query-document pair, guiding the model during training. Secondly, to tackle the transferability issue, we utilize a multi-phase fine-tuning process: initial fine-tuning on a general domain dataset followed by domain-specific fine-tuning using synthetic queries generated by Large Language Models (LLMs). Our empirical evaluations show significant improvements with MarkedDPR, achieving a 42.30% improvement on in-domain data compared to the baseline DPR model, and an average improvement of 17.16% on out-domain data. Additionally, further fine-tuning with synthetic data yielded an additional 26.13% improvement.
Loading