# Environment
To set up the environment for MOLLEO+, run
```conda env create -f environment.yml``` in the base directory.
We currently only support Linux, as certain packages in this environment conflicts with e.g. OSX channels.

# MOLLEO+

This repository is built off of the original MOLLEO code: https://github.com/zoom-wang112358/MOLLEO. All credit goes to the original authors.

We currently only support single objective optimization; original MOLLEO multi_objective directory is included as reference and for any experimentation. We plan to incorporate multi-objective with Boltz-2 and BindingDB data very soon.

## Run
By default, our Boltz-2 script only supports c-MET and BRD4. To run inference on your desired protein target, navigate to ```single_objective/main/boltz.py``` and modify the if-else starting from line 20 to include your protein name and amino acid sequence.

To run MOLLEO+, first ```cd single_objective```. Then, run:
```
python run.py molleo --mol_lm [LLM] --oracles [protein_name] --run_name [run_name]
```
For the ```[LLM]``` option, we support: 

- ```GPT-4```: employs GPT-4.1-mini by default. To use, create a .env file in the home directory and place your OpenAI api key inside as GPT_KEY=...
- ```custom```: uses open source model from HuggingFace or a local path for inference. Employs Llama-3.-8B-Instruct by default. To change the model, navigate to ```single_objective/main/molleo/run.py``` and modify the model path on line 84.
- ```BioT5```: LLM utilized in original MOLLEO project

Then enter your desired ```[protein_name]``` and the desired ```[run_name]``` to begin the process. Make sure you have modified the boltz.py file if you are using a separate protein target, as per the instructions above.

If you would like to utilize our detailed run analysis script `analyze_results.py`, run this command instead:
```
python run.py molleo --mol_lm [LLM] --oracles [protein_name] --run_name [run_name] > logs/[run_name].txt 2>&1
```
To save a full record of the run in a text file for reference in the analysis script.

# BindingDB
We include all of the scripts used to form the BindingDB starting population, as well as our synthetic fine-tuning dataset. To utilize these, first ```cd bindingdb```. To utilize this, you must download the BindingDB database, which occupies around 6 GB of storage. Download it here: https://www.bindingdb.org/rwd/bind/chemsearch/marvin/Download.jsp. Look for the BindingDB_All_[date]_tsv.zip download under "All data in BindingDB". Unzip the download and place it in the ```bindingdb``` directory. Ensure its name is ```BindingDB_All.tsv```.

We include 3 main scripts:

1. ```create_chains.py``` This is the create used to create the ligand chains used to form our synthetic dataset. Start by putting the UniProt entry names (e.g. MET_HUMAN, BRD4_HUMAN, etc) of all the desired targets you want to generate datasets for (it may just be one target) into the ```bindingdb/targets.txt``` text file. Then run ```python create_chains.py``` to create text files containing the chains, which are placed into ```bindingdb/chains```
2. ```create_data_script.py``` After you have generated the ligand chains, the next step to generate the dataset is to feed it into an LLM for synthetic generation. This script is a wrapper around the ```create_data.py``` script which automates the process for many targets. As long as your chain file is created, you can just call ```python create_data_script.py```, which will generate an SFT dataset for each target you have a chain file generated for. Once again, you need an API key in your .env file for this; we use GPT-4.1-nano by default, but you can customize the LLM used to generate the dataset. Note that this process consumes a lot of tokens, so a cheaper model is recommended. You will find the dataset in the ```bindingdb/data``` folder, as well as a supplementary ```bindingdb/summary_data``` that includes all of the past ligand summary generations specifically; we store this in case of potential future use in fine-tuning (not currently employed in this project).
3. ```create_sample.py``` This script generates the BindingDB starting population samples for use in MOLLEO+. Similar to ```create_chains.py```, place your desired targets in ```bindingdb/targets.txt```, then run ```python create_sample.py```, which will put the successfully created samples in ```bindingdb/samples```. To utilize these in MOLLEO+, simply place the file in ```single_objective/data/```. Ensure that the name of the file matches exactly the "oracles" name you pass in to MOLLEO+ to run it (should also be what you register in the ```boltz.py``` file).