# ReaDISH

Data and codes for the paper "Reaction Prediction via Interaction Modeling of Symmetric Difference Shingle Sets".

## Requirements

We implement our model ReaDISH on `Python 3.9.19`. These packages are mainly used:

```
rdkit                2024.3.3
torch                2.3.1
wandb                0.18.6
lightning            2.3.3
pytorch-lightning    2.3.3
salesforce-lavis     1.0.2
unicore              0.0.1
unimol_tools         0.1.0.post1
scikit-learn         1.5.1
```

## Datasets

### Pre-training dataset

We utilize and filter reactions from USPTO and CJHIF. You can download USPTO from https://figshare.com/articles/dataset/Chemical_reactions_from_US_patents_1976-Sep2016_/5104873 and CJHIF from https://github.com/jshmjs45/data_for_chem. `data/generate_shingling.py` is the code to generate shingles to accelate training.

### Downstream dataset

We fine-tune our model on seven publicly available downstream datasets across four tasks. Download data and store in `data`.

## Experiments

### Pre-training

Run `pretraining.py` to pre-train our model. For example,

```
python pretraining.py --pretraining --devices [0,1,2,3,4,5,6,7] --check_val_every_n_epoch 1 --max_epochs 3 --batch_size 4 --accumulate_grad_batches 8 --num_workers 16 --init_lr 5e-5 --min_lr 5e-6
```

The pre-trained model is stored in `checkpoint`. 

### Fine-tuning

Run `finetuning.py` to fine-tune ReaDISH on a given downstream dataset. For example,

```
python finetuning.py --devices [0,1,2,3] --check_val_every_n_epoch 1 --max_epochs 200 --accumulate_grad_batches 1 --batch_size 16 --n
um_workers 16 --init_lr 1e-3 --min_lr 1e-4 --ds_name BH --repeat_times 10 --pred_type regression --pretraining --init_checkpoint checkpoint/pretraining_epoch=01-step=00020000.ckpt
```
