# MolPILE dataset

A large-scale, diverse and curated dataset for molecular representation learning and 
pretraining ML models.

## Initial setup

Install:
- Python 3.11
- uv
- make
- aria2
- ripgrep
- unzip

Then, run `make setup`.

## Running pipelines

From terminal, run `python main_molpile.py`.

If you want to use PyCharm `Run` command, make sure you turn on `Emulate terminal`
option  in run configuration. This will make sure that outputs in the console are
properly rendered.

## Training Mol2Vec

Note that this will require a lot of RAM (at least ~300 GB) and CPU cores
(takes ~24h on 128 cores).

First, create the MolPILE dataset.

Create corpus of ECFP invariants texts:
```commandline
python mol2vec/create_corpus.py
```

Train Mol2Vec embeddings:
```commandline
python mol2vec/train.py
```

## Training ChemBERTa

Note that this will require a lot of RAM (at least ~100 GB) and GPU memory. It also
takes a long time, with tokenization taking ~8h.  # TODO rest

First, create the MolPILE dataset.

Train the tokenizer:
```commandline
python chemberta/train_tokenizer.py
```

Tokenize dataset:
```commandline
python chemberta/tokenize_dataset.py

```

Train the ChemBERTa MLM model:
```commandline
python chemberta/train_mlm.py
```

## Evaluating models

MoleculeNet and TDC datasets are downloaded automatically with scikit-fingerprints.
ApisTox is included as small files. WelQrate datasets need to be downloaded from
[the official website](http://www.welqrate.org/) and put into `chemberta/welqrate_datasets`
(CSV files) and `chemberta/welqrate_datasets/scaffold_split_idxs` (.pt files).
Also do the same for Mol2vec.

Then run appropriate benchmarks from project root, e.g. `python chemberta/benchmark_apistox.py`.
