![](banner.svg)

**MARCEL** is a PyTorch-based benchmark library that evaluates the potential of machine learning on conformer ensembles across a diverse set of molecules, datasets, and models.

# Why Learning over Conformer Ensembles?

It is critical to recognize that in reality molecules are not rigid, static objects; rather, thermodynamically-permissible rotations of chemical bonds, small vibrational motions, and dynamic intermolecular interactions cause molecules to continuously convert between different conformations. As a consequence, many experimentally observable chemical properties depend on the full distribution of thermodynamically-accessible conformers. Also, it is often challenging to determine *a priori* the conformers that predominantly contribute to molecular properties without doing prohibitively expensive simulations. Therefore, it is important to investigate the *collective* power of many different conformer structures lying on the local minima of the potential energy surface, also known as the *conformer ensemble*, for improving molecular representation learning models.

# Datasets

MARCEL include four datasets that cover a diverse range of chemical space, which focuses on four chemically-relevant tasks for both molecules and reactions, with an emphasis on Boltzmann-averaged properties of conformer ensembles computed at the Density-Functional Theory (DFT) level.

## Drugs-75K

Drugs-75K is a subset of the [GEOM-Drugs](https://github.com/learningmatter-mit/geom) dataset, which includes 75,099 molecules with at least 5 rotatable bonds. For each molecule, Auto3D is used to generate and optimize the conformer ensembles and AIMNet-NSE is used to calculate three important DFT-based reactivity descriptors: ionization potential, electron affinity, and electronegativity.

Links: [Download](https://osf.io/rb6sh?view_only=d0fb56497c16419295fcd9168e0f642f), [Instructions](datasets/Drugs)

## Kraken

Kraken is a dataset of 1,552 monodentate organophosphorus (III) ligands along with their DFT-computed conformer ensembles. We consider four 3D catalytic ligand descriptors exhibiting significant variance among conformers: Sterimol B~5~, Sterimol L, buried Sterimol B~5~, and buried Sterimol L. These descriptors quantify the steric size of a substituent in Å, and are commonly employed for Quantitative Structure-Activity Relationship (QSAR) modeling. The buried Sterimol variants describe the steric effects within the first coordination sphere of a metal.

Links: [Download](https://osf.io/x9ghd?view_only=d0fb56497c16419295fcd9168e0f642f), [Instructions](datasets/Kraken)

## EE

EE is a dataset of 872 catalyst-substrate pairs involving 253 Rhodium (Rh)-bound atropisomeric catalysts derived from chiral bisphosphine, with 10 enamides as substrates. The dataset includes conformations of catalyst-substrate transition state complexes in two separate pro-S and pro-R configurations. The task is to predict the Enantiomeric Excess (EE) of the chemical reaction involving the substrate, defined as the absolute ratio between the concentration of each enantiomer in the product distribution. This dataset is generated with Q2MM, which automatically generates Transition State Force Fields (TSFFs) in order to simulate the conformer ensembles of each prochiral transition state complex. EE can then be computed from the conformer ensembles by Boltzmann-averaging the activation energies for the competing transition states. Unlike properties in Drugs-75K and Kraken, EE depends on the conformer ensembles of *each* pro-R and pro-S complex.

Links: Dataset access not publicly available, [Instructions](datasets/EE)

## BDE

BDE is a dataset containing 5,915 organometallic catalysts ML₁L₂ consisting of a metal center (M = Pd, Pt, Au, Ag, Cu, Ni) coordinated to two flexible organic ligands (L₁ and L₂), each selected from a 91-membered ligand library. The data includes conformations of each unbound catalyst, as well as conformations of the catalyst when bound to ethylene and bromide after oxidative addition with vinyl bromide. Each catalyst has an electronic binding energy, computed as the difference in the minimum energies of the bound-catalyst complex and unbound catalyst, following the DFT-optimization of their respective conformer ensembles. Although the binding energies are computed via DFT, the conformers provided for modeling are generated with Open Babel. This realistically represents the setting in which precise conformer ensembles are unknown at inference.

Links: [Download](https://osf.io/fyks9?view_only=d0fb56497c16419295fcd9168e0f642f), [Instructions](datasets/BDE)

# Benchmarks

## Prerequisites

The following packages are required for running the benchmarks.

* `pytorch >= 1.13.1`
* `pyg >= 2.0`
* `rdkit`
* `nni`
* `ogb`

## Dataset Loaders

MARCEL has implemented PyG data loaders for each dataset. Download the dataset and place each zipped file under its corresponding directory, i.e. `datasets/<NAME>/raw`.

| Dataset   | Dataloader class                                           |
| --------- | ---------------------------------------------------------- |
| Drugs-75K | `data.drugs.Drugs`                                         |
| Kraken    | `data.kraken.Kraken`                                       |
| EE        | `data.ee.EE_2D` for 2D models, `data.ee.EE` for the others |
| BDE       | `data.bde.BDE`                                             |

## Batch Samplers

For Drugs-75K and Kraken, use `EnsembleSampler` to sample mini-batches of molecules from `loaders.samplers`. You can specify the sampling strategy to `random` that randomly samples one conformer, `first` that always loads the first conformer in each ensemble, or `all` that loads all conformers.

Since EE and BDE involve interactions between two molecules, we implement another sampler `EnsembleMultiBatchSampler` from `loaders.samplers`. In this case, each conformer of the system will be loaded as a tuple `[data_0, data_1]`, which corresponds to one of the two molecules in the system.

## Instructions on Reproducing Results

The default hyperparameters are set in [`config.py`](benchmarks/config.py). Other model-dependent parameters are stored in the [`params`](benchmarks/params) folder separately. To reproduce the model you want to run, simply change the `config` parameter in `ConfigLoader` to the corresponding model parameter file. Then, specify `dataset` and `target` and change other parameters (e.g., `learning_rate`) when necessary in the command-line arguments.

| Model                            | Training script and key parameters         |
| -------------------------------- | ------------------------------------------ |
| 1D fingerprint model             | `train_fp_rf.py`                           |
| 1D SMILES-based sequential model | `train_1d.py --model1d:model SEQ_ENCODER` |
| 2D model                         | `train_2d.py --model2d:model GRAPH_ENCODER` |
| Single-conformer 3D model        | `train_3d.py --model3d:augmentation False --model3d:model GRAPH_ENCODER` |
| 3D model with conformer sampling | `train_3d.py --model3d:augmentation True  --model3d:model GRAPH_ENCODER` |
| Conformer ensemble model | `train_ensemble.py --model4d:set_encoder SET_ENCODER --model4d:graph_encoder GRAPH_ENCODER` |


## License

The MARCEL benchmarks are licensed under Apache 2.0 License.

