Multivariate Dense Retrieval: A Reproducibility Study

Multivariate Dense Retrieval: A Reproducibility Study

TMLR Paper2548 Authors

19 Apr 2024 (modified: 17 Sept 2024)Rejected by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: The current paradigm in dense retrieval is to represent queries and passages as low-dimensional real-valued vectors using neural language models, and then compute query-passage similarity as the dot product of these vector representations. A limitation of this approach is that these learned representations cannot capture or express uncertainty. At the same time, information retrieval over large corpora contains several sources of uncertainty, such as misspelled or ambiguous text. Consequently, retrieval methods that incorporate uncertainty estimation are more likely to generalize well to such data distribution shifts. The multivariate representation learning (MRL) framework proposed by Zamani & Bendersky (2023) is the first method that works in the direction of modeling uncertainty in dense retrieval. This framework represents queries and passages as multivariate normal distributions, and computes query-passage similarity as the negative Kullback-Leibler (KL) divergence between these distributions. Furthermore, MRL formulates KL divergence as a dot product, allowing for efficient first-stage retrieval using standard maximum inner product search. In this paper, we attempt to reproduce the MRL framework for dense retrieval by Zamani & Bendersky (2023). We find that the original work (i) introduces a typographical/mathematical error early in the formulation of the method that propagates to the rest of the original paper's mathematical formulations, (ii) does not provide all of the necessary information to facilitate reproducibility, and (iii) proposes a training setup to train MRL that, if followed, does not yield the reported performance in a fair comparison. In light of the aforementioned, we address the mathematical error, make some reasonable design choices, and propose an improved training setup that complements the original paper by filling in important details that were unspecified. We further contribute a thorough ablation study which is absent from the original paper, to gain more insight into the impact of the framework's different components. Despite our efforts, we were neither able to reproduce the exact results reported in the original paper, nor to uncover the reported trends against the baselines. Our analysis offers insights as to why that is the case. Most importantly, our empirical results suggest that the definition of variance in MRL does not consistently capture uncertainty. The source code for our reproducibility study is available at: https://anonymous.4open.science/r/multivariate_ir_code_release-AB26.

Submission Length: Long submission (more than 12 pages of main content)

Assigned Action Editor: ~Aditya_Menon1

Submission Number: 2548

Loading