# Lift Your Molecules: Molecular Graph Generation in Latent Euclidean Space

Code accompanying our ICML 2024 submission for review.

## Setup
```
conda create -c conda-forge -n sycodiff rdkit
conda activate sycodiff
conda install pytorch==1.11.0 torchvision==0.12.0 torchaudio==0.11.0 cudatoolkit=11.3 -c pytorch
pip install -r requirements.txt
```

## Data Preparation
To prepare the data for training run the following command
```
python qm9/data/prepare/download.py --datadir data/ --dataname zinc250k --only_explicit_H --prop penalized_logP qed drd2 tpsa
```

This will do the following main steps:
* download the data in form of .txt files of SMILES strings, if not already existent
* for each molecule in the dataset, compute the synthetic coordinates needed to run our model
* process the data and create one .npz file for each split (train/valid/test) that contains all the info needed for training (graph/3D information, properties, etc.)
* compute the data statistics needed for training and sampling

Supported datasets are `zinc250k`, `guacamol`, and `qm9`.

## Training
### Autoencoder + Diffusion Model Training
```
python main_geoldm.py --exp_name sycodiff_training --batch_size 64 --dataset zinc250k --test_epochs 10 --diffusion_steps 1000 --train_diffusion True --train_regressor False --n_epochs 1000 --n_layers 9 --nf 256 --inv_sublayers 2 --tanh False --prodigyopt True --prodigy_lr 1.0 --prodigy_setting -1 --encoder_early_stopping --d_coef 0.1 --wandb_usr wandb_username --ema_decay 0.9999 --n_stability_samples 1000 --n_layers_decoder 0 --n_layers_encoder 1 --inv_sublayers_vae 2 --hidden_nf_vae 128 --latent_nf 2
```
This will first train an autoencoder on the reconstruction task between molecular graphs and latent 3D point clouds, as described in the paper. Then, with frozen autoencoder, the diffusion model training will automatically start and train an EDM model.

### Property Regressor Training
```
python main_geoldm.py --exp_name property_regressor_training --batch_size 64 --dataset zinc250k --test_epochs 1 --patience 50 --train_diffusion False --train_regressor True --n_epochs 500 --ae_path outputs/sycodiff_training --n_layers 4 --tanh False --prodigyopt True --prodigy_lr 1.0 --prodigy_setting -1 --d_coef 0.1 --regression_target penalized_logP qed drd2 tpsa --max_step_regressor 1000 --wandb_usr wandb_username --ema_decay 0.9999 --inv_sublayers_vae 2 --hidden_nf_vae 256
```
This will train a regressor model to predict 4 properties: penalized_logP, qed, drd2, and tpsa. Note that this requires the trained autoencoder from the previous step (given by the argument `ae_path`). The regressor will be used in the conditional generation and optimization tasks.

## Evaluation
### Unconditional Generation
To get the GuacaMol metrics (FCD and KL) run the following command:
```
python guacamol_evaluation/distribution_learning.py --exp_folder outputs/sycodiff_training --dist_file data/zinc250k/smiles/train.txt --batch_size 100 --number_samples 10000 --ckpt_prefix last_ --diffusion_steps 1000
```
This will run the GuacaMol benchmark on the unconditional model. It will sample 10,000 molecules using this model and use the training data to compute the GuacaMol metrics. The results will be saved in a json file.

### Conditional Generation
```
python guacamol_evaluation/evaluate_conditional_generation.py --exp_folder outputs/sycodiff_training --dist_file data/zinc250k/smiles/train.txt --batch_size 100 --number_samples 10000 --ckpt_prefix last_  --regressor_guidance --regressor_exp_folder outputs/property_regressor_training --conditional_prop penalized_logP --guidance_scale 1.5 --guidance_loss l2
```
This will generate 10,000 molecules conditioned on different target values and compute the MAE between the target values and the actual values of the generated molecules. To run it on the other properties simply change the `conditional_prop` argument.

### Constrained Optimization
```
python optimization/eval_optimization.py --exp_name sycodiff_training --batch_size 100 --t_optimize 600 --guidance_exp_name property_regressor_training --guidance_scale 0.75 --ckpt_prefix last_ --optimization_prop penalized_logP --similarity_threshold 0.4 --guidance_linear_schedule 
```
This will run our optimization algorithm on the test molecules and save the optimized molecules that staisfy the similarity constraints along with their improvement values.