Keywords: causal inference, measurement calibration, causal represenation learning
Abstract: Aggregate outcome variables collected through surveys and administrative records are often subject
to systematic measurement error. For instance, in disaster loss databases, county-level losses reported
may differ from the true damages due to variations in on-the-ground data collection capacity,
reporting practices, and event characteristics. Such miscalibration complicates downstream analysis
and decision-making. We study the problem of outcome miscalibration and propose a framework
guided by proxy variables for estimating and correcting the systematic errors. We model the
data-generating process using a causal graph that separates latent content variables driving the true
outcome from the latent bias variables that induce systematic errors. The key insight is that proxy
variables that depend on the true outcome but are independent of the bias mechanism provide identifying
information for quantifying the bias. Leveraging this structure, we introduce a two-stage
approach that utilizes variational autoencoders to disentangle content and bias latents, enabling us
to estimate the effect of bias on the outcome of interest. We analyze the assumptions underlying
our approach and evaluate it on synthetic data, semi-synthetic datasets derived from randomized
trials, and a real-world case study of disaster loss reporting. Our code will be publicly available.
Supplementary Material: zip
Pmlr Agreement: pdf
Submission Number: 95
Loading