## Installation

This codebase is built on top of
[GraphGym](https://pytorch-geometric.readthedocs.io/en/2.0.0/notes/graphgym.html)
and [GraphGPS](https://github.com/rampasek/GraphGPS). Follow the steps below to
set up dependencies, such as [PyTorch](https://pytorch.org/) and
[PyG](https://pytorch-geometric.readthedocs.io/en/latest/):

```bash
# Create a conda environment for this project
conda create -n gpse python=3.10 -y && conda activate gpse

# Install main dependencies PyTorch and PyG
conda install pytorch=1.13 torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia -y
conda install pyg=2.2 -c pyg -c conda-forge -y
pip install pyg-lib -f https://data.pyg.org/whl/torch-1.13.0+cu117.html

# RDKit is required for OGB-LSC PCQM4Mv2 and datasets derived from it.  
conda install openbabel fsspec rdkit -c conda-forge -y

# Install the rest of the pinned dependencies
pip install -r requirements.txt

# Clean up cache
conda clean --all -y
```


## Quick start

### Downlad the pre-trained GPSE model or pre-train it from scratch

The pre-trained GPSE encoder can be downloaded from google drive using gdown as:

```bash
# Pre-trained on MolPCBA (default)
gdown https://drive.google.com/file/d/1ztPwnRIAKyuJr0q6hMzehfR8HNwuBf-E/view?usp=share_link -O pretrained_models/ --fuzzy

# Pre-trained on ZINC
gdown https://drive.google.com/file/d/1dtS35aRAnXjxBl3eKFKy_E_j7obeTwle/view?usp=share_link -O pretrained_models/ --fuzzy

# Pre-trained on PCQM4Mv2
gdown https://drive.google.com/file/d/1QMPPixkodMCg9nnpIgw-WQfyecf5Gjv0/view?usp=share_link -O pretrained_models/ --fuzzy

# Pre-trained on GEOM
gdown https://drive.google.com/file/d/1ZAPKuVb40zhyi0op4a-fwKw_JnvDTPtg/view?usp=share_link -O pretrained_models/ --fuzzy

# Pre-trained on ChEMBL
gdown https://drive.google.com/file/d/12L-wi-Ak3EAaY4dvX2IWaB90SBoSJO_a/view?usp=share_link -O pretrained_models/ --fuzzy
```

You can also pre-train the GPSE model from scratch using the configs provided, e.g.

```bash
python main.py --cfg configs/pretrain/gpse_molpcba.yaml
```

After the pre-training is done, you need to manually move the checkpointed model to the `pretrained_models/` directory.
The checkpoint can be found under `results/gpse_molpcba/<seed>/ckpt/<best_epoch>.pt`, where `<seed>` is the random seed
for this run (0 by default), and `<best_epoch>` is the best epoch number (you will only have one file, that *is* the
best epoch).

### Run downstream evaluations

After you have prepared the pre-trained model `gpse_molpcba.pt`, you can then run downstream evaluation for models that
uses `GPSE` encoded features. For example, to run the `ZINC` benchmark:

```bash
python main.py --cfg configs/mol_bench/zinc-GPS+GPSE.yaml
```

You can also execute batch of runs using the run scripts prepared under `run/`. For example, to run all molecular
property prediction benchmarks (`ZINC-subset`, `PCQM4Mv2-subset`, `ogbg-molhiv`, and `ogbg-molpcba`)

```bash
sh run/mol_bench.sh
```

## Generating embedding visualizations

This part is for generating the embedding PCA plots in appendix E, Fig. E2.
The plots here show how random initial node features enable breaking symmetries in otherwise 1-WL indistinguishable graphs.
By default, the embeddings are generated by `gnn_encoder.py` drawn from random normals (see line 32).
To compare with identical input features (e.g. 1), we return a `np.ones` array of size `(n, dim_in)`, instead of `np.random.normal`.
Running the code below with and without the changes described above will result in two `.pt` files of the embeddings.
The code and further instructions to generate the visualizations are found in `viz/wl_viz.ipynb`.
```
python viz/wl_test.py --cfg experiments/wl_bench/toywl-GPS+GNNPE_v9.yaml
```
