# MoltenFlow

Code accompanying the paper **"Property-Guided Molecular Generation and Optimization via Latent Flows"**.

MoltenFlow combines:
- a **sequence VAE** to embed discrete molecular strings into a continuous latent space,
- **property-oriented latent shaping** via an auxiliary property predictor trained on top of latents,
- a **latent flow-matching prior** learned over VAE latent codes, and
- **surrogate-gradient guidance** during latent ODE integration for conditional generation and local optimization.

---

## Quick Reference

| Stage | Config | Command |
|-------|--------|---------|
| **A**: Pretrain VAE | `configs/pretrain_vae_zinc250k.yaml` | `python -m moltenflow pretrain-vae --config configs/pretrain_vae_zinc250k.yaml` |
| **B**: Fine-tune + Flow | `configs/finetune_vae_and_train_guided_flow.yaml` | `python -m moltenflow finetune-vae --config configs/finetune_vae_and_train_guided_flow.yaml` |
| **C**: Optimization | `configs/experiments/budgeted_optimization.yaml` | `python scripts/run_budgeted_optimization.py --config configs/experiments/budgeted_optimization.yaml` |

---

## Setup

### Requirements
- **Python 3.11+**
- PyTorch (CUDA recommended)
- RDKit (for property labels and evaluation)
- SELFIES (if training on SELFIES)
- uv (package manager)

### Install dependencies
```bash
uv sync --all-extras
```

For Bayesian Optimization baselines:
```bash
uv sync --extra bo
```

---

## Data Preparation (ZINC250K)

### Dataset
MoltenFlow experiments use **ZINC250K** (~250k drug-like molecules). Models are trained on molecular strings:
- **SMILES** (default in the paper) or
- **SELFIES** (used for representation ablations and to guarantee syntactic validity)

Typical representation settings:
- Representation: SELFIES
- Max sequence length: 128
- Vocabulary size: ~111

### Property labels
For multi-objective optimization, the surrogate predicts:
- **QED** (drug-likeness, higher is better), bounded to [0, 1]
- **SAS** (synthetic accessibility, lower is better), bounded to [1, 10]

### Download and prepare data
```bash
python scripts/prepare_zinc_250k.py
```

This downloads ZINC250K from HuggingFace and saves it to `data/raw/zinc250k.csv`.

### Preprocessing (optional)
If your workflow requires preprocessing (canonicalization, SELFIES conversion, property computation):
```bash
# Example preprocessing command (adapt to your needs)
python your_preprocess_script.py \
  --input data/raw/zinc250k.csv \
  --output data/processed/zinc250k_selfies.parquet \
  --representation selfies \
  --max_len 128
```

Expected output format:
- Molecule string (`smiles` or `selfies`)
- Property columns: `qed`, `sas`
- Train/val/test split indicators (or separate files)

---

## Training

Training proceeds in **two stages**.

### Stage A: Pretrain the VAE

**Config:** `configs/pretrain_vae_zinc250k.yaml`

This stage trains a Transformer-based sequence VAE on molecular strings using the standard VAE objective (reconstruction + KL).

**Run:**
```bash
uv run python -m <TRAIN_ENTRYPOINT> --config configs/pretrain_vae_zinc250k.yaml
```

**Artifacts produced:**
- VAE checkpoint (encoder + decoder)
- Tokenizer/vocabulary artifacts
- Training logs

**Paper defaults** (see config for authoritative values):
| Parameter | Value |
|-----------|-------|
| Epochs | 150 |
| Optimizer | AdamW |
| Learning rate | 1e-4 |
| Batch size | 256 |

---

### Stage B: Fine-tune VAE + Train Guided Latent Flow

**Config:** `configs/finetune_vae_and_train_guided_flow.yaml`

This stage does two things:
1. **Property-oriented fine-tuning**: Trains a property predictor on pooled VAE latents and allows gradients to shape the encoder.
2. **Flow matching**: Trains a time-conditioned vector field in latent space to model the distribution of valid latents.

**Run:**
```bash
uv run python <TRAIN_ENTRYPOINT> --config configs/finetune_vae_and_train_guided_flow.yaml
```

> **Note:** Update `vae_pretrain.checkpoint_path` in the config to point to your Stage A checkpoint.

**Artifacts produced:**
- Fine-tuned VAE + property head checkpoint
- Flow model checkpoint
- Training logs

**Paper defaults** (see config for authoritative values):
| Component | Parameter | Value |
|-----------|-----------|-------|
| VAE fine-tune | Batch size | 1024 |
| VAE fine-tune | Learning rate | 1e-3 |
| VAE fine-tune | Property weight (λ) | 1.0 |
| Flow training | Learning rate | 2e-4 |
| Flow training | Batch size | 1024 |

---

### Stage C: Budgeted Optimization Experiments

After training, run multi-objective optimization experiments to compare MoltenFlow against baselines.

**Configs:**
- Single run: `configs/experiments/budgeted_optimization.yaml`
- Multi-seed comparison: `configs/experiments/multi_seed_comparison.yaml`

#### Single Optimization Run

Run a single optimization experiment with a specific method:

```bash
# MoltenFlow (flow + surrogate guidance)
python scripts/run_budgeted_optimization.py \
    --config configs/experiments/budgeted_optimization.yaml \
    --method moltenflow \
    --budget 100 \
    --seed 42
```

**Available methods:**

| Method | Description | Dependencies |
|--------|-------------|--------------|
| `moltenflow` | Guided flow optimization (default) | - |
| `gradient_ascent` | Pure gradient ascent (no flow) - ablation | - |
| `bo_mogp` | Bayesian optimization with multi-output GP | `uv sync --extra bo` |
| `bo_2gp` | Bayesian optimization with two independent GPs | `uv sync --extra bo` |

**Examples:**

```bash
# Gradient ascent ablation (no flow)
uv run python scripts/run_budgeted_optimization.py \
    --method gradient_ascent \
    --budget 100 \
    --seed 42

# Bayesian optimization (requires: uv sync --extra bo)
python scripts/run_budgeted_optimization.py \
    --method bo_mogp \
    --budget 100 \
    --seed 42
```

#### Multi-Seed Experiments (Statistical Comparison)

For rigorous statistical comparison across multiple seeds:

```bash
python scripts/run_multi_seed_experiment.py \
    --config configs/experiments/multi_seed_comparison.yaml
```

This script:
1. Runs all method × seed combinations
2. Aggregates results
3. Generates plots with confidence intervals
4. Computes statistical tests (Mann-Whitney U)

**Options:**
```bash
# Run with parallel workers
python scripts/run_multi_seed_experiment.py --workers 4

# Run specific stages only
python scripts/run_multi_seed_experiment.py --stages plot,report --log-dir experiments/my_run/

# Quick test with fewer seeds
python scripts/run_multi_seed_experiment.py --n-seeds 3 --methods moltenflow,gradient_ascent
```

> **Important:** Before running Stage C, update the checkpoint paths in the config to point to your trained models from Stages A and B.

---

## Guided Sampling / Optimization (High Level)

At inference time, MoltenFlow modifies the latent flow ODE by adding a surrogate-gradient term:

- **Flow velocity:** `v_ω(z(t), t)`
- **Objective gradient:** `g(z) = ∇_z J(z; c)`
- **Guided dynamics:** `ż(t) = v_ω(z(t), t) - γ · g(z(t))`

Where `γ` controls the strength of guidance.

This supports:
- **Conditioned generation**: Start from noise at t=0, integrate to t=1
- **Local optimization**: Start from an encoded molecule, add noise, integrate only near the end of the trajectory

---

## Reproducing Paper Settings

Common ZINC optimization settings (see configs for authoritative values):

| Setting | Value |
|---------|-------|
| Dataset | ZINC250K |
| Representation | SELFIES |
| Max length | 128 |
| Vocab size | ~111 |
| Latent dimension | 128 |
| Latent tokens (K) | 8 |

**Optimization-time hyperparameters:**
- Guidance strength (γ)
- Noise scale (σ)
- Integration start time (t_start)
- Number of Euler steps
- Gradient clipping / normalization

---

## Project Structure

```
moltenflow_icml/
├── configs/
│   ├── pretrain_vae_zinc250k.yaml          # Stage A config
│   ├── finetune_vae_and_train_guided_flow.yaml  # Stage B config
│   └── experiments/
│       ├── budgeted_optimization.yaml      # Single optimization run
│       └── multi_seed_comparison.yaml      # Multi-seed experiments
├── scripts/
│   ├── prepare_zinc_250k.py                # Download ZINC250K
│   ├── run_budgeted_optimization.py        # Single optimization run
│   └── run_multi_seed_experiment.py        # Multi-seed experiments
├── src/moltenflow/                         # Main package
│   ├── cli.py                              # CLI entry points
│   ├── data/                               # Data loading utilities
│   ├── models/                             # VAE, Flow, Surrogate
│   ├── training/                           # Training scripts
│   ├── inference/                          # Generation & optimization
│   └── optimization/                       # Budgeted optimization
└── pyproject.toml                          # Dependencies
```

---

## Citation

If you use this code, please cite the accompanying paper.

```bibtex
@inproceedings{moltenflow,
  title     = {Property-Guided Molecular Generation and Optimization via Latent Flows},
  author    = {Anonymous},
  booktitle = {International Conference on Machine Learning (ICML)},
  year      = {2026}
}
```
