Adaptive robust integration of internal data with external summaries under distributional shift

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Data Integration; Distributionally Robust Optimization; Distributional Shift; Empirical Bayes; Ensemble Methods.
TL;DR: We present a robust method for integrating internal individual-level data with external summary-level data when covariate spaces differ and distributions shift.
Abstract: Integrating evidence from heterogeneous datasets is challenging when predictor spaces differ and data distributions shift. Large datasets such as biobanks---refer to as external data---offer substantial sample sizes but often lack in-depth information due to cost constraints. In contrast, internal datasets from smaller analytic studies provide richer, individual-level detail. We propose a general Distributionally Robust Optimization (DRO) framework for integrating internal individual-level data with external summary-level data under distributional shift. Our method minimizes Cressie-Read divergence between a full model (fit to internal data with many predictors) and a reduced model (estimated from external data with fewer predictors), using a specialized nested-iteration algorithm. While effective under moderate shift, standard DRO can degrade when the distributional shift is severe. To mitigate this, we introduce an Empirical Bayes DRO (EB-DRO), which stabilizes estimates by adaptively shrinking toward internal-only solutions. We further develop an ensemble EB-DRO method that aggregates across multiple divergence families to improve robustness without selecting a single best family. Our proposed methods preserve privacy by operating on external summary statistics, support robust integration under shift, and enable valid inference when no shift is present. Simulations show that DRO improves over internal-only estimates under light shifts, EB-DRO adds stability under greater shifts, and ensemble EB-DRO achieves the most consistent robustness overall.
Primary Area: transfer learning, meta learning, and lifelong learning
Submission Number: 10204
Loading