PRADA: Pre-Train Ranking Models With Diverse Relevance Signals Mined From Search Logs

Published: 01 Jan 2025, Last Modified: 20 May 2025IEEE Trans. Knowl. Data Eng. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Existing studies have proven that pre-trained ranking models outperform pre-trained language models when it comes to ranking tasks. To pre-train such models, researchers have utilized large-scale search logs and clicks as weak-supervised signals of query-document relevance. However, search logs are incomplete and sparse. Different users with the same intent tend to use various forms of queries. It is hard for recorded clicks to sufficiently cover diverse relevance patterns between queries and documents. Moreover, the diverse intentions of a large user base lead to long-tail distributions of search intents. Deriving sufficient relevance signals from sparse clicks of these long-tail intents poses another challenge. Therefore, there is significant potential for exploring richer relevance signals beyond direct clicks to pre-train high-quality ranking models. To tackle this problem, we develop two exploratory data augmentation strategies that consider the diversity of query forms from local and global perspectives, hence mining potential and diverse relevance signals from search logs. A generative augmentation strategy is also devised to create supplementary positive samples, to enhance the ranking ability for long-tail query intents. We leverage a multi-level pairwise ranking objective and a contrastive learning approach to enable our model to capture fine-grained relevance patterns and be robust for noisy training samples. Experimental results on a large-scale public dataset and a commercial dataset confirm that our model, namely PRADA, can yield better ranking effectiveness over existing pre-trained ranking models.
Loading