# Replication guide:
## Environment setup:
Create virtual env using your favorite tool:
```
python -m venv new_venv_name
source new_venv_name/bin/activate
pip install -r requirements.txt
```

## Data preparation
Data from huggingface will be automatically handled. For USPTO datasets, first download `uspto_mixed.pickle` and `uspto_50.pickle` from open dataset (https://az.app.box.com/s/7eci3nd9vy0xplqniitpk02rbg9q2zcq/folder/144882141119). Save them to `./asset/`. And run 

```
python create_uspto_dataset.py
```

## System preparation
Depending on your system, you might want to initialize accelerate first using
```
accelerate config
```
You can use `./ds_config.json` if asked to provide a deepspeed file.

## Supervised Experiment

The command to run supervised training, either for baseline training, or pre-trained finetuned base model for RTRL, you do (you can specify the number of GPUs you want to use):
```
accelerate launch --num_processes=4 --gpu_ids=0,1,2,3 train.py --config config/train_config.yaml
```
All setup should be specified in the config file. Most entries are self-explanatory, the following notes the options that needs explation:

base_model_name: you can use any huggingface model for this, but for the models in the paper select from the following: {OpenDFM/ChemDFM-v1.5-8B, Qwen/Qwen3-8B, meta-llama/Meta-Llama-3-8B-Instruct(for Mol-Instruction)}

model_mol_type: OpenDFM/ChemDFM-v1.5-8B and Qwen/Qwen3-8B is SMILES, and Mol-Instruction is SELFIES

use_chemdfm: set to true when using OpenDFM/ChemDFM-v1.5-8B and false otherwise.

target_datasets: select from  {chebi, lm, ./asset/uspto_50_dataset, ./asset/uspto_mixed_dataset}. You can also specify split. "./asset/uspto_50_dataset,train" means the train split of the uspto 50 dataset.

dataset_limits: specify the number of training data sampled in one epoch

tasks: all tasks can be found in `./data/prompt_templates.json`. Tasks ending with "chemdfm" are for ChemDFM, Qwen and GPT, all others are for Mol-Instruction.

load_directory: a directory with `lora_hist.json` in it. This file stores a sequence of LoRA adapters to be loaded onto the base model. So, to run mol-instruction, you need to specify
```
load_directory: "./hf_lora_dir/mol_instruct_lora"
```
For other models you can leave it as null if training from scratch. If you want to further finetune a model after SFT or RL. The training process will save a directory that allows you load.

## RTRL

You can run RTRL in several mode, and you can experiment with different modes.

Judge and Generator Server mode:
```
sh ./run_rl_training_accelerate_policy.sh
```
In this mode both judge and policy model will run as separate process. This is the mode that the paper use. 8 GPU recommended, if you have less than that, you need modify the GPU configuration in `run_rl_training_accelerate_policy.sh`

Judge Server mode:
```
sh ./run_rl_training_accelerate.sh
```

If you have, say 4 GPU, you can use this approach, only the judge will run in vLLM as separate process, the remaining 3 GPUs will be used for both training and completion generation.

In process mode:

```
accelerate launch --num_processes=4 --gpu_ids=0,1,2,3 rl_train.py --config config/rl_config.yaml
```

This a not recommended. As this will be very slow, and might need 80GB vRAM to run.


RTRL training also needs to be specified in a config file, as in `./config/rl_config.yaml` Most of the arguments are the same as the those in the supervised setting. Specific to RTRL:

rl_task_type: suppose the round trip is A->B->A. This should be B2A. The available options can be found in `rl_train_vllm_accelerate.py`. 

rl_task: should be compatible with rl_task_type. e.g. `cap2mol` should be compatible with `molecule` task, as `molecule` task is translate a caption to molecule.

tasks: the forward task.


## Save and load model

We need to load model to inference, and RTRL can load finetuned model to further improve. We implement this as loading a sequence LoRA adapters.

After each training, the LoRA finetuned in the current training is saved to `./mollora/$TIMESTAMP`.

The experiment is saved to `./{molfinetune,molrl}/$TIMESTAMP`. A `lora_hist.json` file is under this directory, recording `./mollora/$TIMESTAMP`.

When you want to further finetune this model put
```
load_directory: ./{molfinetune,molrl}/$TIMESTAMP
```
Our script will automatically load the LoRA, merge it, and create a new LoRA adaptor on top of it. After the new training is finished. The new adapter will be in `./mollora/$NEW_TIMESTAMP`, and the `lora_hist.json` file under `./{molfinetune,molrl}/$NEW_TIMESTAMP` will be like
```json
[
    './mollora/$TIMESTAMP',
    './mollora/$NEW_TIMESTAMP'
]
```
And you continously ammend new LoRA and finetune the model.

For RL, you can load separate generator/policy and judge in `load_directory` and `judge_dir`.

## Inference

Inference command:

```
CUDA_VISIBLE_DEVICES=0,1,2,3 OMP_NUM_THREADS=1 python inference.py --config ./config/inference_config.yaml
```

Remember to specify split in `inference_config.yaml`.

Each inference results will be printed out, and the results will be saved as Huggingface dataset locally in ./hf_local_data/$TASK_NAME/$BASE_MODEL_NAME/$TIMESTAMP. The generated results will be in column `gen_xx`, e.g. molecule will be in `gen_mol`.

## Use synthetic data

To use synthetic data, you will need to make, for example, `gen_mol` to `mol`, so the script recognizes the target molecule column. You can do this by specifying dataset_processing as a preprocess step in config.

```
dataset_processing:
  "./hf_local_data/caption_chemdfm/ChemDFM-v1.5-8B/$TIMESTAMP": "swap_gen_mol"
```

A full list of implemented preprocessing is in `./data/data_loader.py`.