A Large Scale Search Dataset for Unbiased Learning to RankDownload PDF

Published: 17 Sept 2022, Last Modified: 12 Mar 2024NeurIPS 2022 Datasets and Benchmarks Readers: Everyone
Keywords: unbiased learning to rank
Abstract: The unbiased learning to rank (ULTR) problem has been greatly advanced by recent deep learning techniques and well-designed debias algorithms. However, promising results on the existing benchmark datasets may not be extended to the practical scenario due to some limitations of existing datasets. First, their semantic feature extractions are outdated while state-of-the-art large-scale pre-trained language models like BERT cannot be utilized due to the lack of original text. Second, display features are incomplete; thus in-depth study on ULTR is impossible such as the displayed abstract for analyzing the click necessary bias. Third, synthetic user feedback has been adopted by most existing datasets and real-world user feedback is greatly missing. To overcome these disadvantages, we introduce the Baidu-ULTR dataset. It involves randomly sampled 1.2 billion searching sessions and 7,008 expert annotated queries(397,572 query document pairs). Baidu-ULTR is the first billion-level dataset for ULTR. Particularly, it offers: (1)the original semantic features and pre-trained language models of different sizes; (2)sufficient display information such as position, displayed height, and displayed abstract, enabling the comprehensive study of multiple displayed biases; and (3)rich user feedback on search result pages (SERPs) like dwelling time, allowing for user engagement optimization and promoting the exploration of multi-task learning in ULTR. Furthermore, we present the design principle of Baidu-ULTR and the performance of representative ULTR algorithms on Baidu-ULTR. The Baidu-ULTR dataset and corresponding baseline implementations are available at https://github.com/ChuXiaokai/baidu_ultr_dataset. The dataset homepage is available at https://searchscience.baidu.com/dataset.html.
Author Statement: Yes
License: The dataset can be freely downloaded and noncommercially used with a custom license CC BY-NC 4.0.
TL;DR: we introduce a new large-scale unbiased learning to rank dataset with rich real-world user feedback and sufficient display information.
URL: https://github.com/ChuXiaokai/baidu_ultr_dataset
Dataset Url: https://searchscience.baidu.com/dataset.html
Supplementary Material: pdf
Contribution Process Agreement: Yes
In Person Attendance: Yes
Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 3 code implementations](https://www.catalyzex.com/paper/arxiv:2207.03051/code)
16 Replies