Keywords: retrieval, embeddings, large language models
Abstract: Modern retrieval models are typically pretrained on masked language modeling (MLM) or causal language modeling (CLM) tasks, where retrieval capabilities emerge only incidentally. Achieving state-of-the-art performance requires finetuning these models on supervised query-document pairs, which are expensive to curate and sparse for specialized domains. We propose a scalable pretraining approach that directly targets retrieval tasks using a fully self-supervised pipeline. By leveraging web-scale text corpora, we construct retrieval pairs without manual annotations. Our method employs a novel contrastive objective that aligns prefix embeddings generated by a causal transformer with suffix embeddings from an anti-causal transformer. This enables the model to learn fine-grained associations between queries and their completions, analogous to the next-token prediction paradigm in generative models. Our results demonstrate the viability of this method, suggesting that it can bridge the data scarcity gap in information retrieval and imbue retrieval models with the zero-shot reasoning capabilities typically reserved for generative language models.
Supplementary Material: zip
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 9980
Loading