Examining the Bias of In-Batch Sampling in Similarity Learning with Two-Tower Models
Abstract: Two-tower models are widely used for applications involving learning similarities between pairs of entities, such as user-item pairs in recommender systems. These models are commonly trained using stochastic gradient methods. However, uniformly sampling data often leads to problematic batches that lack positive pairs, especially when positives are a minority of the dataset—a situation particularly common in similarity learning. Instead, a strategy known as in-batch sampling is widely adopted to ensure the presence of positive pairs and training efficiency. Nevertheless, in-batch sampling introduces its own issues, such as mistaking positives for negatives and oversampling popular pairs, resulting in significant performance degradation. In this work, we provide the first systematic analysis of these issues, showing that they all arise from the inconsistency between the expected objective under in-batch sampling and the full-data objective. We refer to this inconsistency as the bias of in-batch sampling. To validate our analysis, we design an unbiased batch loss and conduct rigorous experiments directly comparing unbiased and biased losses. The results provide strong empirical confirmation of our theoretical findings.
Submission Number: 461
Loading