Towards Backwards-Compatible Data with Confounded Domain Adaptation

TMLR Paper2911 Authors

23 Jun 2024 (modified: 28 Jun 2024)Under review for TMLREveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Most current domain adaptation methods address either covariate shift or label shift, but are not applicable where they occur simultaneously and are confounded with each other. Domain adaptation approaches which do account for such confounding are designed to adapt covariates to optimally predict a particular label whose shift is confounded with covariate shift. In this paper, we instead seek to achieve general-purpose data backwards compatibility. This would allow the adapted covariates to be used for a variety of downstream problems, including on pre-existing prediction models and on data analytics tasks. To do this we consider a modification of generalized label shift (GLS), which we call confounded shift. We present a novel framework for this problem, based on minimizing the expected divergence between the source and target conditional distributions, conditioning on possible confounders. Within this framework, we provide concrete implementations using the Gaussian reverse Kullback-Leibler divergence and the maximum mean discrepancy. Finally, we demonstrate our approach on synthetic and real datasets.
Submission Length: Long submission (more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=ObBRkCcKRM
Changes Since Last Submission: We changed the method to make it simpler, faster, and more flexible. Our updated software has a scikit-learn API and a modular implementation with PyTorch dataloader and optimizer components. Instead of a variety of provided conditional estimators (linear Gaussian, GMM, and GP), we have used MICE-Forest with LightGBM for conditional generation modeling. These changes provided better experimental results, nearly eliminated the need for hyperparameter tuning, simplified the presentation of our Method section, and opened the door to future work with deep learning-based adaptations. We expanded the experiments to compare performance on California Housing data, ANSUR II anthropometric data, and SNAREseq single-cell multi-omics data. We also expanded the thoroughness of both new and previous experiments, notably showing superior performance when applying adapted data to a fixed preexisting classifier. Finally, we have rewritten the entire manuscript to improve clarity and address reviewer concerns from the previous submission. (1) We have added a motivating example section to concretely describe the problem setting. Notably, we motivate the "fixed preexisting classifier" setting, which rules out training on the new target dataset, even when the confounder happens to be the label (ie Z=Y). (2) We switched to standard nomenclature for source and target domains, and changed notation to use Z and Y to distinguish between confounders and downstream target variables. (3) We updated the Experiments section to improve the clarity of the figures. (4) We have rewritten the Introduction and Related Work sections to more fully compare to prior work, and the Discussion section to clarify limitations of our approach.
Assigned Action Editor: ~Rémi_Flamary1
Submission Number: 2911
Loading