# Modeling the Data-Generating Process is Necessary for Out-of-Distribution Generalization
This is the code repository of the paper Modeling the Data-Generating Process is Necessary for Out-of-Distribution Generalization.
This repository is adapted from the PyTorch suite
[DomainBed](https://github.com/facebookresearch/DomainBed) and [OoD-Bench](https://github.com/m-Just/OoD-Bench).

## Abstract
Recent empirical studies on domain generalization (DG) have shown that DG algorithms that perform well on some distribution shifts fail on others, and no state-of-the-art DG algorithm performs consistently well on all shifts. Moreover, real-world data often has multiple distribution shifts over different attributes; hence we introduce multi-attribute distribution shift datasets and find that the accuracy of existing DG algorithms falls even further. To explain these results, we provide a formal characterization of generalization under multi-attribute shifts using a canonical causal graph. Based on the relationship between spurious attributes and the classification label, we obtain realizations of the canonical causal graph that characterize common distribution shifts and show that each shift entails different independence constraints over observed variables. As a result, we prove that any algorithm based on a single, fixed constraint cannot work well across all shifts, providing theoretical evidence for mixed empirical results on DG algorithms. Based on this insight, we develop Causally Adaptive Constraint Minimization (CACM), an algorithm that uses knowledge about the data-generating process to adaptively identify and apply the correct independence constraints for regularization. Results on fully synthetic, MNIST, small NORB, and Waterbirds datasets, covering binary and multi-valued attributes and labels, show that adaptive dataset-dependent constraints lead to the highest accuracy on unseen domains whereas incorrect constraints fail to do so. Our results demonstrate the importance of modeling the causal relationships inherent in the data-generating process.

## Setting up the environment

```
pip install -r requirements.txt
```
## Datasets 
The experiments use the following datasets:
- MNIST
- [small NORB](https://cs.nyu.edu/~ylclab/data/norb-v1.0-small/)
- Waterbirds, formed from [Caltech-UCSD Birds 200](http://www.vision.caltech.edu/visipedia/CUB-200.html) + [Places](http://places2.csail.mit.edu/)


To run the code, you need to enter `DomainBed` folder and run the training scripts. You also need to provide the `data_dir` path where the dataset is stored.

### MNIST
The MNIST dataset in `domainbed/datasets` downloads and initialized the datastet and no external download is required.

### small NORB
The small NORB files are to downloaded from [here](https://cs.nyu.edu/~ylclab/data/norb-v1.0-small/). The processing scripts are included in the code and will initialize the dataset.

### Waterbirds
The Waterbirds dataset is generated using the script in [group_DRO](https://github.com/kohpangwei/group_DRO) repo.

## Training and Evaluation

To launch a sweep over all the hyperparameters:
```
python -m domainbed.scripts.sweep launch\
       --data_dir=/my/datasets/path\
       --output_dir=/my/sweep/output/path\
       --command_launcher MyLauncher\
       --algorithms {algorithm}\
       --datasets {dataset}\
```

Here, `MyLauncher` is your cluster's command launcher, as implemented in command_launchers.py. Here, the entire sweep trains  3 independent trials x 20 random hyper-parameter choices.

To evaluate the results of the sweep:
```
python -m domainbed.scripts.collect_results\
       --input_dir=/my/sweep/output/path
```



