Surrogate Modeling
==================

The surrogate module provides property prediction from molecular latent representations
with support for gradient-based guidance.

.. toctree::
   :maxdepth: 2

   architecture
   training
   guidance

Overview
--------

The surrogate pipeline consists of:

1. **Latent Generation**: Convert SMILES to latent vectors (fingerprints or VAE encodings)
2. **Property Prediction**: MLP-based predictor with optional conditional inputs
3. **Gradient Guidance**: Compute gradients for property optimization

Key Features
------------

- **Conditional Prediction**: Support for conditions like temperature and pressure
- **Multi-property**: Predict multiple properties simultaneously
- **Masked Loss**: Train on incomplete data with NaN handling
- **Gradient-ready**: Designed for guidance-based molecular optimization

Synthetic Data Generation
-------------------------

For ablation studies, MoltenFlow supports generating synthetic property datasets
from real VAE latent embeddings. This allows controlled experiments with known
ground-truth property relationships.

Generate synthetic data using::

    uv run python scripts/generate_toydata_real_latents.py --pretrain

Or run the full experimental pipeline::

    uv run python scripts/run_toydata_pipeline.py

See :mod:`moltenflow.data.toy_dataset` for the underlying functions.
