Retrieval-based Language Models Using a Multi-domain Datastore

Published: 28 Oct 2023, Last Modified: 02 Apr 2024DistShift 2023 PosterEveryoneRevisionsBibTeX
Keywords: Retrieval-based language model, domain generalization, out-of-distribution
TL;DR: This work studies kNN-LM using a multi-domain datastore and shows its robustness to out-of-distribution data and superior performance than single-domain oracle.
Abstract: Retrieval-based language models (LMs) can generalize well to unseen test domains, but typically assume access to a datastore of examples from the target domain. It remains an open question if these models are robust with more general datastores, which may include other out of domain data or cover multiple different test domains. In this paper, we study this question by constructing a multi-domain datastore, using a kNN-LM approach. We first show that, on domains that are part of the multi-domain datastore, the model is comparable to or even better than the model with an oracle test domain datastore. We also find that, on domains that are unseen during training and not part of the datastore, using a multi-domain datastore consistently outperforms an oracle single-domain datastore. Together, our results show that kNN-LM is highly robust at out-of-distribution generalization and can effectively target many domains at once, without the oracle domain knowledge assumptions included in all previous work.
Submission Number: 34
Loading