# M$^{4}$olGen: Multi-Agent, Multi-Stage Molecular Generation under Precise Multi-Property Constraints

**ICLR 2026 submission artifact (code)**

**TL;DR.** M$^{4}$olGen performs **precise, property-constrained** molecule generation with a two-stage design:
1) a retrieval-augmented, fragment-level **prototype reasoner** (LLM agents), and  
2) a **GRPO-trained** fragment optimizer for controlled multi-hop refinement.

This repo contains data prep utilities, training scripts (SFT + GRPO), and a Stage-I demo runner.  
> **Note:** Full datasets and model checkpoints will be released after review (size limits for submission).

---

## What’s in this repo?

```
.
├── run.py                         # Stage-I demo runner (LLM multi-agent planner)
├── decomposer.py                  # Query → numeric constraints
├── retriever.py                   # Target-aware retrieval from molecule DB
├── action.py                      # Action agent: fragment-level edit proposals
├── evaluator.py                   # RDKit-based property + validity feedback
├── optimizer.py                   # Fragment utilities / optimizer helpers
├── eval.py                        # (dev) scripts for analysis; not required for demo
├── data/
│   ├── decomposing_data.py        # Build BRICS fragments + connectivity
│   ├── neighbour_faster.py        # Neighbor-pair mining (1-hop edits)
│   └── text_steps.py              # Text prompt construction helpers
├── training/
│   ├── train_sft.py               # Supervised finetuning (LoRA) on QA-style data
│   └── train_grpo.py              # GRPO training with multi-property rewards
└── docs/
    └── assets/
        └── training_report.jpg    # Reward traces (inserted below)
```

---

## Environment

We tested with **Python 3.10+** and **CUDA-compatible PyTorch**.

```bash
conda create -n m4olgen python=3.10 -y
conda activate m4olgen

pip install -r requirements.txt
```

> If you plan to call commercial LLMs in `run.py`, set your provider keys (e.g., `OPENAI_API_KEY`) in your environment.

---

## Quick start (Stage-I demo)

`run.py` executes the **retrieval-augmented reasoner** that iteratively edits fragments to meet target properties.

```bash
python run.py   --query "Please help me generate a new valid molecule with QED=0.72, LogP=1.50, MW=300."   --db_path /path/to/your/molecule_db.parquet   --retrieval_eps_qed 0.05 --retrieval_eps_logp 0.5 --retrieval_eps_mw 25   --max_iterations 6 --topk 30
```

Typical console output (abbrev.):

```
Constraints: {'qed': 0.72, 'logp': 1.50, 'mw': 300}
Relevant molecule count: 27
[Step 1] candidate=... feedback=...
...
✅ SUCCESS at step 5: CCNC1=CNNC=CC(O)CC(C)C1Cc1ncccc1
```

**What it does**
- Parses the query → numeric targets (QED, LogP, MW).
- Retrieves near-target references (tolerance defaults: QED ±0.05, LogP ±0.5, MW ±25 Da).
- Runs an **action agent** that proposes BRICS-level edits with property feedback from RDKit until success or the iteration limit.

> Stage-II (multi-hop GRPO optimizer) is provided in `training/train_grpo.py` 

---

## Training

### 1) Supervised Finetuning (SFT)

```bash
python training/train_sft.py   --model_name OpenDFM/ChemDFM-v1.5-8B   --dataset_name Alan123/qa_dataset_with_qed_logp_mw_grpo   --output_dir ./outputs/sft-lora-qed-logp-mw   --epochs 0.4 --batch_size 4 --grad_accum 4 --lr 5e-5 --max_seq_len 1024   --report_to wandb
```

- Adds special tokens: `<SMILES> </SMILES> <QED> </QED> <LogP> </LogP> <MW> </MW>`.
- Formats each example as a **question–answer** turn with the system prompt.

### 2) GRPO Training (Stage-II policy)

```bash
python training/train_grpo.py   --base_model OpenDFM/ChemDFM-v1.5-8B   --sft_checkpoint ./outputs/sft-lora-qed-logp-mw/checkpoint-XXXX   --output_dir ./outputs/grpo-qed-logp-mw   --per_device_train_batch_size 4   --grad_accum 8   --learning_rate 3e-5   --max_steps 40_000   --report_to wandb
```

**Reward design (high-level):**
- `strict_format_reward`: format + operation validity.
- `multi_property_reward`: scaled distance to QED/LogP/MW targets with direction penalties.
- `repetition_penalty_reward`: discourages repeated SMILES/fragments.

---

## Training report

We monitor GRPO reward components during training. Example traces:

<p align="center">
  <img src="docs/assets/training_report.jpg" width="880" alt="GRPO reward curves: format, repetition penalty, and multi-property components (mean & std)">
</p>

- **Multi-property reward (mean)** steadily increases, indicating improved alignment with numeric targets.
- **Repetition penalty (mean)** moves toward zero while **std** stabilizes, suggesting healthier exploration.
- **Strict-format** metrics stabilize as the model learns well-formed edit statements + valid SMILES.

> The composite reward is the sum of these components (with weights); higher is better.

---

## Data & checkpoints

- **Will be released after review** (size limits for submission):
  - ~2.95M molecules with BRICS fragments + connectivity and RDKit properties,
  - ~1.17M neighbor pairs (single fragment edit) with measured property deltas,
  - SFT and GRPO LoRA adapters / merged checkpoints.

Scripts under `data/` reproduce the preprocessing pipeline.