Toward Best Practices for Training Multilingual Dense Retrieval Models

Published: 01 Jan 2024, Last Modified: 25 Apr 2024ACM Trans. Inf. Syst. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Dense retrieval models using a transformer-based bi-encoder architecture have emerged as an active area of research. In this article, we focus on the task of monolingual retrieval in a variety of typologically diverse languages using such an architecture. Although recent work with multilingual transformers demonstrates that they exhibit strong cross-lingual generalization capabilities, there remain many open research questions, which we tackle here. Our study is organized as a “best practices” guide for training multilingual dense retrieval models, broken down into three main scenarios: when a multilingual transformer is available, but training data in the form of relevance judgments are not available in the language and domain of interest (“have model, no data”); when both models and training data are available (“have model and data”); and when training data are available but not models (“have data, no model”). In considering these scenarios, we gain a better understanding of the role of multi-stage fine-tuning, the strength of cross-lingual transfer under various conditions, the usefulness of out-of-language data, and the advantages of multilingual vs. monolingual transformers. Our recommendations offer a guide for practitioners building search applications, particularly for low-resource languages, and while our work leaves open a number of research questions, we provide a solid foundation for future work.
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview