<!--
 * @Date: 2022-08-06 15:13:42
 * @LastEditors: yuhhong
 * @LastEditTime: 2022-09-27 13:13:55
-->
# Data Preprocess

In general, we preprocess the data into the following two types of files: 

- Task data -> `<task_name>_train.csv` and `<task_name>_test.csv`

- Conformers data -> `<task_name>_train_<conformers_type>.sdf` and `<task_name>_test_<conformers_type>.sdf`



## Classification Datasets: BBBP, Tox21, ToxCast, Sider, ClinTox, MUV, HIV

All the commands are in `./experiments/process_cls.sh`. We only show the examples to generate conformers of `etkdg` and `omega`. You can also generate conformers of `2d` and `etkdgv3`. 

### 1. Task data:

```bash
python ./preprocess/preprocess_cls.py --dataset <dataset name> --path <path to the dataset>
```

### 2. Conformers data: 

```bash
python ./preprocess/gen_conformers.py --path <path to csv file> --dataset <dataset name> --conf_type <conformers type>
```



## Regression Datasets: CCS, RT, SOL

### 1. Task data: 

**CCS (Collision Cross Section)**: 

```bash
# register in allccs website here:
# http://allccs.zhulab.cn/
python ./preprocess/download_allccs.py --user <user_name> --passw <passwords> --output ./data/CCS/allccs.csv
python ./preprocess/preprocess_allccs.py --input ./data/CCS/allccs.py --output ./data/CCS/ccs_train.csv

# download the bushccs dataset here:
# https://biophysicalms.org/ccsdatabase/
python ./preprocess/preprocess_bushccs.py --input ./data/CCS/<file_name_of_download_bushccs> --allccs ./data/CCS/ccs_train.csv --output ./data/CCS/ccs_test.csv
```

**RT (Retention Time)**: 

```bash
# download the smrt dataset here: 
# https://figshare.com/articles/dataset/The_METLIN_small_molecule_dataset_for_machine_learning-based_retention_time_prediction/8038913

python ./preprocess/preprocess_rt.py --input ./data/RT/SMRT_dataset.sdf --output ./data/RT/SMRT_dataset_clean.sdf
python ./preprocess/random_split_rt.py --input ./data/RT/SMRT_dataset_clean.sdf --output_train ./data/RT/smrt_train.sdf --output_test ./data/RT/smrt_test.sdf
```

**SOL (Solubility)**: 

```bash
# download the AqSolDB dataset here:
# https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/OVHAW8

python ./preprocess/preprocess_sol.py --input ./data/SOL/curated-solubility-dataset.csv --output ./data/SOL/SOL_pre.csv
python ./preprocess/random_split_sol.py --input ./data/SOL/SOL_pre.csv --output_train ./data/SOL/sol_train.csv --output_test ./data/SOL/sol_test.csv
```

### 2. Conformers data: 

The commands to generate conformers are in `./experiments/process_reg.sh`. We only show the examples to generate conformers of `etkdg` and `omega`. You can also generate conformers of `2d` and `etkdgv3`. 



## Reference

- Zhou, Zhiwei, et al. "Ion mobility collision cross-section atlas for known and unknown metabolite annotation in untargeted metabolomics." Nature communications 11.1 (2020): 1-13.
- Bush, Matthew F., et al. "Collision cross sections of proteins and their complexes: a calibration framework and database for gas-phase structural biology." Analytical chemistry 82.22 (2010): 9557-9565.
- Domingo-Almenara X, Guijas C, Billings E, et al. The METLIN small molecule dataset for machine learning-based retention time prediction[J]. Nature communications, 2019, 10(1): 1-9. 
- Sorkun M C, Khetan A, Er S. AqSolDB, a curated reference set of aqueous solubility and 2D descriptors for a diverse set of compounds[J]. Scientific data, 2019, 6(1): 1-8.
- Ruddigkeit L, Van Deursen R, Blum L C, et al. Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17[J]. Journal of chemical information and modeling, 2012, 52(11): 2864-2875.
- Ramakrishnan R, Dral P O, Rupp M, et al. Quantum chemistry structures and properties of 134 kilo molecules[J]. Scientific data, 2014, 1(1): 1-7.
