Unveiling Extraneous Sampling Bias with Data Missing-Not-At-Random

Chunyuan Zheng; Haocheng Yang; Haoxuan Li; Mengyue Yang

Unveiling Extraneous Sampling Bias with Data Missing-Not-At-Random

Chunyuan Zheng, Haocheng Yang, Haoxuan Li, Mengyue Yang

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Missing-Not-At-Random, Sampling Bias

Abstract: Selection bias poses a widely recognized challenge for unbiased evaluation and learning in many industrial scenarios. For example, in recommender systems, it arises from the users' selective interactions with items. Recently, doubly robust and its variants have been widely studied to achieve debiased learning of prediction models, however, all of them consider a simple exact matching scenario, i.e., the units (such as user-item pairs in a recommender system) are the same between the training and test sets. In practice, there may be limited or even no overlap in units between the training and test. In this paper, we consider a more practical scenario: the joint distribution of the feature and rating is the same in the training and test sets. Theoretical analysis shows that the previous DR estimator is biased even if the imputed errors and learned propensities are correct in this scenario. In addition, we propose a novel super-population doubly robust estimator (SuperDR), which can achieve a more accurate estimation and desirable generalization error bound compared to the existing DR estimators, and extend the joint learning algorithm for training the prediction and imputation models. We conduct extensive experiments on three real-world datasets, including a large-scale industrial dataset, to show the effectiveness of our method. The code is available at https://github.com/ChunyuanZheng/neurips-25-SuperDR.

Primary Area: Probabilistic methods (e.g., variational inference, causal inference, Gaussian processes)

Submission Number: 19687

Loading