# Differentiable Molecular Graph for Molecule Optimization 



## 1. conda 



### environment
```bash
source activate differentiable_molecular_graph
```


## 2. data preparation


### 2.1 Data for training GNN

We use `ZINC` database, which contains around 250K drug-like molecules. 


### 2.2 Data for inference (i.e., molecule to be optimized)



### 2.3 Oracle and Tasks 

Molecular property is evaluated by oracle. 

* `DRD2` 
* `JNK3`
* `GSK3B` 
* `LogP` 
* `QED` Quantitative Estimate of Drug-likeness.  


### 2.4 Get Labels

We use oracle to evaluate molecule's properties to obtain the labels for molecules. 

- input
  - `raw_data/zinc.tab`: all the smiles in ZINC, around 250K. 

- output
  - `data/zinc_*.txt`: `*` can be QED, LogP, JNK3, DRD2, GSK3B, etcs. 


```bash  
python src/data_zinc.py 
```

### 2.5 Generate Vocabulary 
In this project, the basic unit is substructure, which contains frequent atoms and rings. The vocabulary is the set of all these atoms and rings. 

- substructure
  - basic unit in molecule tree, including rings and atoms. 

- input
  - `raw_data/zinc.tab`: all the smiles in ZINC, around 250K. 

- output
  - `data/all_vocabulary.txt`: including all the substructures in ZINC.   
  - `data/selected_vocabulary.txt`: vocabulary, frequent substructures. 


```bash 
python src/data_generate_vocabulary.py
```

### 2.6 data cleaning  

We want to remove the molecules that contains substructure that is not in vocabulary 


- input 
  - `data/selected_vocabulary.txt`: vocabulary 
  - `raw_data/zinc.tab`: all the smiles in ZINC
  - `data/zinc_QED.txt` 


- output
  - `data/zinc_QED_clean.txt`


```bash 
python src/data_cleaning.py 
```







## 3. Train GNN

- **training data** includes `(X,y)` pairs, where `X` is the molecule, `y` is the label. 


- **model** is 
  - `y = GNN(X)`
  - We consider `GCN` (graph convolutional network) as neural architecture of GNN. Other variants include `GIN`, `GAT`, `MPN`. 


- input 
  - `data/zinc_QED_clean.txt`

- output 
  - `save_model/model_epoch_*.ckpt`: saved model. 
  - `figure/` Changes of loss on validation set as a function of iterations. 


```bash 
python src/train_gnn.py | tee log/train_gnn.log 
```




## 4. code interpretation

### 4.1 DPP 

see `dpp.py` for implementation of DPP. 

#### 4.1.1 denovo 
`inference_denovo.py`


```python
def distribution_learning(.., .., ..):  
    ...
    oracle_screening...
    current_set = dpp(smiles_score_lst = smiles_score_lst, num_return = population_size)  ### dpp 
    ...  
```


#### 4.1.2 goal-directed

`inference_utils.py`

```python
def optimize_single_molecule_all_generations():

    oracle_screening...
    dpp 
    ...
```

### 4.2 differentiable molecular graph 

#### 4.2.1 generate differentiable molecular graph 
pls see `chemutils.py`
```python

def smiles2graph():
  ... 

def smiles2differentiable_graph():
  ... 

def differentiable_graph2smiles():
  ... 

  add_atom_at_position
  add_fragment_at_position
  delete_substructure_at_idx 
  ## replace = delete + add 

def differentiable_graph2smiles():
  ... 
```

#### 4.2.2 update differentiable molecular graph

pls see `inference_utils.py`
```python
def optimize_single_molecule_one_iterate(smiles):
    differentiable_graph = smiles2differentiable_graph(smiles)
    optimized_differentiable_graph = update_molecule(gnn, differentiable_graph)
    new_smiles = differentiable_graph2smiles(optimized_differentiable_graph)

```


pls see `module.py` 
```python
class GCN:

  def update_molecule():
    ... 
```

















## 5. Task Pipeline 

### 5.1 QED+LogP+JNK+GSK `qedlogpjnkgsk`

#### 5.1.1 data
- input
  - `data/clean_zinc.txt`
  - `data/zinc_*.txt`: * can be {QED, LogP, JNK3, GSK3B}. 
- output 
  - `data/qed_logp_jnk_gsk.txt`

```
python src/data_labelling_qedlogpjnkgsk.py 
```

#### 5.1.2 train
- input 
  - `data/qed_logp_jnk_gsk.txt`
- output 
  - `save_model/qed_logp_jnk_gsk_*.ckpt`

```
python src/train_qed_logp_jnk_gsk.py 
```

#### 5.1.3 inference 
- input 
  - `save_model/`
  - `data/` test data 
- output 
  - `result/denovo_qedlogpjnkgsk.pkl`

```bash 
python src/inference_denovo_qedlogpjnkgsk.py
```

#### 5.1.4 evaluation

```bash 
python src/evaluate_denovo_qedlogpjnkgsk.py 
```

#### 5.1.5 case study 

```bash
python src/casestudy_qedlogpjnkgsk.py 
```



### qed 


#### train gnn
```bash
python src/train_qed.py 
```
output: `save_model/QED_epoch_0_iter_75900_validloss_0.5631.ckpt`

#### denovo 
```bash
python src/denovo_qed.py 
```
output: `result/denovo_from_NC1\=NC\=CC\=N1_qed.pkl` 


#### statistics 
```bash
python src/evaluate_denovo_qed.py 
```

#### learning curve 
```bash
python src/result_analysis_qed.py 
```

#### case study 
```bash
python src/casestudy_qed.py 
```





### logp 

#### data (0,1)
```bash
python src/data_cleaning.py 
```

output: `data/zinc_LogP_clean2.txt`

#### train gnn
```bash
python src/train_logp.py 
```

output: `save_model/LogP*`, `save_model/LogP2*`

#### denovo 
```bash
python src/denovo_logp.py
```


#### statistics 
```bash
python src/evaluate_denovo_logp.py 
```

#### learning curve 
```bash
python src/result_analysis_logp.py 
```

#### case study 
```bash
python src/casestudy_logp.py 
```





### jnk 


#### train gnn
```bash
python src/train_jnk.py 
```

#### denovo 
```bash
python src/denovo_jnk.py
```


#### statistics 
```bash
python src/evaluate_denovo_jnk.py 
```

top-10

#### learning curve 
```bash
python src/result_analysis_jnk.py 
```

#### case study 
```bash
python src/casestudy_jnk.py 
```











## jnkgsk

```bash
python src/denovo_jnkgsk.py 
```
output: `result/denovo_jnkgsk.pkl`

#### statistics 
```bash
python src/evaluate_denovo_jnkgsk.py 
```

#### case study 
```bash
python src/casestudy_jnkgsk.py 
```


## oracle limited jnkgsk 


### train gnn
```bash
python src/train_jnkgsk2k.py 
```

### denovo inference 
```bash 
python src/denovo_jnkgsk2k.py
```

### evaluate 

```bash 
python src/evaluate_denovo_jnkgsk2k.py 
```













