Pretrained deep models outperform GBDTs in Learning-To-Rank under label scarcity

21 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Learning-To-Rank, Gradient Boosted Decision Trees, Deep learning, Self-supervised learning, Unsupervised pretraining
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: Unsupervised pretraining can enable deep rankers to outperform GBDT-based rankers when unlabeled data outnumbers labeled data.
Abstract: While deep learning (DL) models are state-of-the-art in text and image domains, they have not yet consistently outperformed Gradient Boosted Decision Trees (GBDTs) on tabular Learning-To-Rank (LTR) problems. Most of the recent performance gains attained by DL models in text and image tasks have used unsupervised pretraining, which exploits orders of magnitude more unlabeled data than labeled data. To the best of our knowledge, unsupervised pretraining has not been applied to the LTR problem, which often produces vast amounts of unlabeled data. In this work, we study whether unsupervised pretraining of deep models can improve LTR performance over GBDTs and other non-pretrained models. By incorporating simple design choices--including SimCLR-Rank, an LTR-specific pretraining loss--we produce pretrained deep learning models that consistently (across datasets) outperform GBDTs (and other non-pretrained rankers) in the case where there is more unlabeled data than labeled data. This performance improvement occurs not only on average but also on outlier queries. We base our empirical conclusions off of experiments on (1) public benchmark tabular LTR datasets, and (2) a large industry-scale proprietary ranking dataset. Code is provided at https://anonymous.4open.science/r/ltr-pretrain-0DAD/README.md.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 2970
Loading