Keywords: learning from biased data, strategic learning, missing data, MNAR, self-selection, econometrics
Abstract: In the process of applying for a job across several similar firms, applicants often have the option to exclude certain features from a CV, e.g., photo, GPA, standardized test scores, etc. If applicants desire the best income offer possible and can submit multiple applications to similar positions, they may exclude or include various of these optional features on different applications to see which yields the best results, eventually accepting the highest offer. But if an analyst then would like to estimate what makes a good worker using the applications (features) and incomes (outcomes) of the finally accepted offers, she will have an endogeneity problem! The excluded features, which we term ``obscured'' will be missing not at random, meaning simple imputation methods such as the conditional expectation will result in biased estimates. We formalize this problem and present a preliminary result in which we reduce our obscured setting to a high-dimensional instantiation of the setting from Cherapanamjeri et al.. Unfortunately, this reduction increases the number of variables by an amount combinatorial in the dimension of the problem, meaning the algorithmic tool for this setting will not be efficient in the original parameters. We present possible next steps such as approximate SGD on the MLE and kernelization to get around the increase in variables.
Submission Number: 152
Loading