
# Data Preparation

Our datasets are generated as the following procedures.

## SPMotif Datasets

We adopt the codes of DIR to generate the SPMotif datasets.
`SPMotif-Struc` is basically the same as the SPMotif datasets in DIR and
can be generated by running `dataset_gen/gen_struc.py`,
with a bias configuration specifying the value of `global_b`:

```bash
cd dataset_gen
python gen_struc.py
```

The generated data will be stored as in `./data/SPMotif-{global_b}` at the root directory of this repo.
To use the dataset in `main.py`, specify the `--dataset` option and `--bias` option as `mSPMotif` and a corresponding bias, respectively.

To generate the `SPMotif-Mixed` datasets, simply running the similar codes,
with a bias configuration specifying the value of `global_b`:

```bash
cd dataset_gen
python gen_mixed.py
```

The generated data will be stored as in `./data/mSPMotif-{global_b}` at the root directory of this repo.
The `gen_mixed.py` will add the graph size shifts and structure-level shifts while the `./datasets/spmotif_dataset.py`
will automatically add node feature-level shifts during the data preparation.
To use the dataset in `main.py`, specify the `--dataset` option and `--bias` option as `mSPMotif` and a corresponding bias, respectively.

## DrugOOD Datasets

To obtain the DrugOOD datasets tested in our paper, i.e., `drugood_lbap_core_ic50_assay`, `drugood_lbap_core_ic50_scaffold`
and `drugood_lbap_core_ic50_size`,
we use the DrugOOD curation codes based on the commit `eeb00b8da7646e1947ca7aec93041052a48bd45e` and `chembl_29` database.
After curating the datasets, put the corresponding json files under `./data/DrugOOD`,
and specify the `--dataset` option as the corresponding dataset name to use, e.g., `drugood_lbap_core_ic50_assay`.

## CMNIST-sp

The CMNIST dataset is generated following the Invariant Risk Minimization
and then converted into graphs using the SLIC superpixels algorithm.
To generate the dataset, simply run the codes as the following:

```bash
cd dataset_gen
python prepare_mnist.py  --dataset 'cmnist'  -t 8 -s 'train'
python prepare_mnist.py  --dataset 'cmnist'  -t 8 -s 'test'
```

and the generated data will be put into `./data/CMNISTSP` at the root directory of this repo.
Note that two auxiliary datasets `./data/MNIST` and `./data/ColoredMNIST` will also be created as the base for the generation of `./data/CMNISTSP`.
To use the dataset, simply specify `--dataset` option as `CMNIST`.

## Graph-SST5 and Twitter

Both of `Graph-SST5` and `Twitter` are based on the datasets provided by [DIG](https://github.com/divelab/DIG).
To get the datasets, you may download via this [link](https://drive.google.com/drive/folders/1dt0aGMBvCEUYzaG00TYu1D03GPO7305z)
provided by DIG and the GNN explainability survey authors.
Then unzip the data into `./data/Graph-SST2/raw` and `./data/Graph-Twitter/raw`.
By specifying `--dataset` as the dataset name in `main.py`, the data loading process will
add the degree biases automatically.

## NCI1, NCI109, PROTEINS and DD

We use the datasets provided by [size-invariant-GNNs](https://github.com/PurdueMINDS/size-invariant-GNNs) authors,
who already sampled the datasets with graph size distribution shifts injected.
The datasets can be downloaed via this [link](https://www.dropbox.com/s/38eg3twe4dd1hbt/data.zip).
After downloading, simply unzip the datasets into `./data/TU`.
To use the datasets, simply specify `--dataset` as the dataset name in `main.py`.
