# What's in this repository
* configs: 
    * dataset
        * qm9_datset
    * model
        * mlp.yaml - not functional due to sizing for batch, do not use
        * molnet.yaml
        * transformer.yaml
    * training
        * default_training.yaml
    * config.yaml - specifies the default dataset, model, and training configs (yaml files) to use. these can be overriden from the command line, e.g. "model=molnet"
* datasets.py - code related to dataset classes
* models.py - contains model class definitions
* utils.py - contains utilities of all kinds
* training.py - contains train and eval scripts
* main.py - driver for processing configs, setting up datasets/model/optimizer, calling training code, and saving plots

# Setup notes
You should set an environment variable DATA_DIR that points to the directory where your dataset is stored. It is assumed that it's stored in a folder given by the dataset's name (see code, this should be easy to tweak as needed, e.g. by adding a new field to the dataset configs). wandb logging is used by default, which will prompt logging in, but you can also turn it off via +training.use_wandb=False 

# Config notes
We  use config files to set up runs (see hydra documentation [here](https://hydra.cc/docs/tutorials/basic/your_first_app/simple_cli/)). 
The most important concept is that there is a hierarchical system of configs, which all get combined via hydra. For example, we will choose a different config for training, for the dataset, and for the model, which are all referenced by filename in config.yaml. config.yaml describes the defaults, but they can be overridden from the command line. For example, consider changing the number of trianing epochs. Here, because this field already exists in the default, dataset.epochs=10. For something like save_dir that doesn't exist, you can use a plus sign: +save_dir='run_name'. Using two pluses "++", which will make a new entry if it doesn't exist, or override it if it does: "++training.epochs=20"

# Example function calls

```python main.py +save_dir='run_name' model=molnet ++training.epochs=20 ++training.use_print_every=5```

```python main.py +save_dir='run_name' ++dataset.task='regression'```

# guide to adding a new dataset

There are several steps to adding a new dataset in the current repository's setup.

1. You will first need to define your dataset class in datasets.py, and import it into main.py. It should have basic functionality: "__getitem__(self, idx)" should return a pair (datapoint, idx). 

2. Then, you need to update the function get_dataset in main.py, which will involve your new dataset class - but that's not all! 

You also need to define the associated operators to canonicalize and label your data. 

3. The function "canonicalize_operator" takes as input data and idx, and outputs the canonicalized data. 

4. The function "label_operator" takes as input just a datapoint, and outputs its label. (For standard torchvision datasets, this operator is trivial, as the datapoint already includes the label as the second element of a tuple, but for molecular datasets the label may need to be extracted in a dataset-dependent way.) 

5. The function "transform_operator" takes as input just a datapoint (again, however it is returned by the base dataset), and transforms it. This allows for both invariance and equivariance, depending on how it is defined. You can define these three functions in utils.py, and simply import them to main.py. As an example, you can look at ToyCircleDataset.

Almost there!

6. The next step is to add a model in models.py that can process your dataset. The class name of this model will be used in the config. 

7. If you would like to use the task-dependent metric, you also need to add (1) a model that learns to output a canonicalization c(x), (2) a binary classification model that takes as input pairs (c(x), y), and/or (3) a prediction model that takes as input just c(x) and tries to predict y.

Once this is implemented, there's one final step:

8. You will need to make new configs for the dataset, the model, and potentially the training (all in the configs directory). 

** Todo: streamline the process for adding a new dataset so that it's not scattered across functions!

# Master list of function calls to reproduce experiments in the paper 

## Swiss Roll

### Ordinary prediction task (binary classification)

Note that the train- and test-time augmentations can be toggled on or off, as shown below. 

```python main.py --config-name=swiss_roll_classification ++dataset.prob=1.0 ++dataset.augment_args.do_augment=True ++dataset.augment_args.transform=1.0 ++dataset.augment_args.train=True ++dataset.augment_args.val=False ++dataset.augment_args.test=False ++save_dir=ignore```

### Task-Independent Detection Metric

```python main.py --config-name=swiss_roll_detection ++dataset.prob=1.0 ++dataset.augment_args.do_augment=False ++save_dir=ignore```

### Task-Dependent Detection Metric

```python main.py --config-name=swiss_roll_task_detection ++dataset.prob=1.0 ++dataset.augment_args.do_augment=False ++save_dir=ignore```

### Task-Dependent Prediction Metric

```python main.py --config-name=swiss_roll_task_direct ++dataset.prob=1.0 ++dataset.augment_args.do_augment=False ++dataset.task_dependent_args.c_args.learned=True ++save_dir=ignore```

## MNIST 

### Ordinary prediction task (classification)
As before, one can turn augmentation on/off in the same way. 
```python main.py --config-name=mnist_classification ++save_dir=ignore```

To run the group averaged model, use the config ```mnist_classification_c4_av```.

### Task-Independent Detection Metric

```python main.py --config-name=mnist_detection  ++dataset.augment_args.do_augment=False ++save_dir=ignore```

### Task-Dependent Detection Metric

```python main.py --config-name=mnist_task_detection ++dataset.augment_args.do_augment=False ++save_dir=ignore```

### Task-Dependent Prediction Metric

```python main.py --config-name=mnist_task_direct ++dataset.augment_args.do_augment=False ++dataset.task_dependent_args.c_args.learned=True ++save_dir=ignore```

## QM9

### Ordinary prediction task (regression)

Note that the model can be changed, as shown below. Using the qm9_atomic dataset, we are using the standard Anderson splits (i.e. from Cormorant, which were subsequently used by EDM and its follow-ups). The property to predict can be changed by setting dataset.target, as shown below.

```python main.py --config-name=qm9_regression model=e3convnet ++dataset.augment_args.do_augment=True ++dataset.augment_args.transform=1.0 ++dataset.augment_args.train=True ++dataset.augment_args.val=False ++dataset.augment_args.test=False ++dataset.name='qm9_atomic' +dataset.split_args.split_type='anderson' ++dataset.target='U0' ++save_dir=ignore```

To run the e3nn model, add ```model=e3convnet```. To run the so3 group averaged model, change the config name to ```qm9_regression_so3_ave```.

### Task-Independent Detection Metric

```python main.py --config-name=qm9_detection model=transformer ++dataset.augment_args.do_augment=False ++save_dir=ignore```

### Task-Independent Canonicalization Metric: Predict g from gx

Note that we use the ordinary QM9 dataset from torch geometric for this experiment.
```python main.py --config-name=qm9_predict_g model=transformer ++dataset.augment_args.do_augment=False ++save_dir=ignore ++dataset.split_args.split_type=ignore```

### Task-Dependent Detection Metric

```python main.py --config-name=qm9_task_detection ++dataset.augment_args.do_augment=False ++save_dir=ignore```

### Task-Dependent Prediction Metric

```python main.py --config-name=qm9_task_direct ++dataset.augment_args.do_augment=False ++dataset.task_dependent_args.c_args.learned=True ++save_dir=ignore```

## Local QM9

### Task-Independent Detection Metric

```python main.py --config-name=local_qm9_detection model=transformer ++dataset.augment_args.do_augment=False ++save_dir=ignore```

### Task-Independent Canonicalization Metric: Predict g from gx

```python main.py --config-name=local_qm9_predict_g model=transformer ++dataset.augment_args.do_augment=False ++save_dir=ignore```

## QM7b response properties

Data is based on the QM7b dataset, taken from this paper that calculated response properties at varying levels of DFT accuracy (https://www.nature.com/articles/s41597-019-0157-8). Data can be downloaded from https://archive.materialscloud.org/record/2019.0002/v3 CCSD_daDZ.tar.gz. Untar the resulting .xyz files into a directory in $DATADIR. The configs assume that this is labeled qm7.

### Regression 
```python main.py --config-name=qm7_regression ++training.use_wandb=True model=graphormer ++training.epochs=500 +save_dir=ignore ```

Note for e3convnet, there is a separate config in the model folder you must use titled e3convnet_qm7. Change the irreps_out argument of this model depending if you are predicting a scalar or a higher order property. To use the group averaged model, change the config_name to ```qm7_regression_so3_ave.yaml```. Depending on if the output is a scalar or not, change the argument of model_args.output_is_vector: False for the group averaged model.

### Task-Independent Detection Metric
```python main.py --config-name=qm7_detection ++save_dir=ignore```

### Task-Dependent Detection Metric
```python main.py --config-name=qm7_task_detection ++save_dir=ignore```

## QM7b response properties

Data is based on the QM7b dataset, taken from this paper that calculated response properties at varying levels of DFT accuracy (https://www.nature.com/articles/s41597-019-0157-8). Data can be downloaded from https://archive.materialscloud.org/record/2019.0002/v3 CCSD_daDZ.tar.gz. Untar the resulting .xyz files into a directory in $DATADIR. The configs assume that this is labeled qm7.

### Regression 
```python main.py --config-name=qm7_regression ++training.use_wandb=True model=graphormer ++training.epochs=500 +save_dir=ignore ```
Note for e3convnet, there is a separate config in the model folder you must use titled e3convnet_qm7

### Task-Independent Detection Metric
```python main.py --config-name=qm7_detection ++save_dir=ignore```

### Task-Dependent Detection Metric
```python main.py --config-name=qm7_task_detection ++save_dir=ignore```

## ModelNet40

### Task-independent metric

<pre>
DATA_DIR=...  python main.py dataset=modelnet_detection model=transformer_for_modelnet training=modelnet_training ++training.epochs=30 
</pre>

### Direct prediction task-dependent metric
<pre>
DATA_DIR=...  python main.py dataset=modelnet_task_direct model=mlp_for_rot_c training=modelnet_training ++training.epochs=300 
</pre>

### Task-dependent detection metric
<pre>
DATA_DIR=...  python main.py dataset=modelnet_task_detection model=mlp_for_rot_c_y training=modelnet_training ++model.++training.epochs=1200  
</pre>

### Classification Task
<pre>
#Augmentation setting can be changed at augment_args in dataset config. 
DATA_DIR=...  python main.py dataset=modelnet_classification model=transformer_for_modelnet training=modelnet_training ++model.hidden_dim=256 ++model.num_heads=8  ++model.num_layers=6   ++training.epochs=300 ++training.batch_size=128
</pre>



## MD17
At the moment, we have only implemented the task-independent metric for this dataset. The script is for a given molecule specified with dataset.mol_name.

### Task-Dependent Detection Metric
```python main.py --config-name=md17_detection ++save_dir=ignore```

## OC20
At the moment, we have only implemented the task-independent metric for this dataset. The catalyst or adsorbate can be filtered with dataset.filter_mol.

### Task-Dependent Detection Metric
```python main.py --config-name=oc20_detection ++save_dir=ignore```
