Consistent Data Distribution Sampling for Large-scale Retrieval

Hongyu Ou; Jun Yin; Huanqin Wu; ANAN LIU; Lin Zhao; Tao Chen; Yuekui Yang; TAO YANG

Consistent Data Distribution Sampling for Large-scale Retrieval

Hongyu Ou, Jun Yin, Huanqin Wu, ANAN LIU, Lin Zhao, Tao Chen, Yuekui Yang, TAO YANG

Published: 01 Feb 2023, Last Modified: 13 Feb 2023Submitted to ICLR 2023Readers: Everyone

Keywords: Retrieval, Neural Networks, Deep Learning, Recommender Systems, Information Systems

TL;DR: A novel negative sampling strategy to tackle training-inference inconsistency of data distribution for large-scale retrieval.

Abstract: Retrieving candidate items with low latency and computational cost is important for large-scale advertising systems. Negative sampling is a general approach to model million-scale items with rich features in the retrieval. The training-inference inconsistency of data distribution brought from sampling negatives is a key challenge. In this work, we propose a novel negative sampling strategy Consistent Data Distribution Sampling (CDDS) to solve such an issue. Specifically, we employ a relative large-scale of uniform training negatives and batch negatives to adequately train long-tail and hot items respectively, and employ high divergence negatives to improve the learning convergence. To make the above training samples approximate the serving item data distribution, we introduce an auxiliary loss based on an asynchronous item embedding matrix over the entire item pool. Offline experiments on real datasets achieve SOTA performance. Online experiments with multiple advertising scenarios show that our method has achieved significant increases in GMV. The source code will be released in the future.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Submission Guidelines: Yes

Please Choose The Closest Area That Your Submission Falls Into: Deep Learning and representational learning

5 Replies

Loading