Keywords: Causal Representation Learning Benchmarks, Domain Generalization Benchmarks
TL;DR: We study when domain generalization datasets to benchmark causal representation learning; we give evidence that datasets where ERM solutions do not give accuracy on the line are necessary.
Abstract: Benchmarking causal representation learning for real-world high-dimensional settings where most relevant causal variables are not directly observed remains a challenge. Notably, one promise of causal representations is their robustness to interventions, enabling models to generalize effectively under distribution shifts---domain generalization. Given this connection, we ask to what extent domain generalization performance can serve as a reliable proxy task/benchmark for causal representation learning in such complex datasets. In this work, we provide theoretical evidence that one condition that identifies reliable domain generalization tasks that are reliable proxies is when non-causal correlations with labels/outcomes In-Distribution are reversed or have sufficiently reduced signal-to-noise ratio Out-Of-Distribution. Additionally, we demonstrate that benchmarks with this reversal do not have strong positive correlations between in-distribution (ID) and out-of-distribution (OOD) accuracy, commonly called "accuracy on the line." Finally, we characterize our derived conditions on state-of-the-art domain generalization benchmarks to identify effective proxy tasks for causal representation learning.
Submission Number: 33
Loading