# MALT: Multi-Anchor Latent Transduction

This repository contains the reference code for **MALT**, a transductive adaptation framework for molecular property prediction. It includes graph-based, Chemprop-based, and feature-based variants, along with utilities for preprocessing molecular datasets and reproducing the bilinear transduction baseline reported in the paper.

## Table of Contents
- [A. Environment Setup](#a-environment-setup)
- [B. Data Preparation (Optional)](#b-data-preparation-optional)
- [C. Finetuning Inductive Models](#c-finetuning-inductive-models)
- [D. Running MALT Variants (GIN, Chemprop)](#d-running-malt-variants-gin-chemprop)
- [E. Running Transduction Baselines (BLT and Variants)](#e-running-transduction-baselines-blt-and-variants)
- [F. Experiment Tracking & Outputs](#f-experiment-tracking--outputs)
- [G. Repository Layout Highlights](#g-repository-layout-highlights)
- [H. Troubleshooting Tips](#h-troubleshooting-tips)
## A. Environment Setup

1. **Install Miniconda or Anaconda** and ensure GPU drivers/CUDA match the versions listed in `environment.yml`

2. **Create the environment:**
   ```bash
   conda env create -f environment.yml
   ```
  **Important:** Update the `prefix` at the bottom of the file so it points to your local conda installation before running the command.

3. **Activate the environment:**
   ```bash
   conda activate <your-env-name>  # default is 't'
   pip install -r requirements.txt
   ```

4. **For fresh Chemprop finetuning runs:** Spin up a separate conda environment and install `chemprop==1.6.1` (the version used in `environment.yml`) to keep its dependencies isolated.
   ```bash
   conda create -n chemprop python=3.8
   conda activate chemprop
   pip install chemprop==1.6.1
   ```
   **Note:** For MALT-Chemprop finetuning, this separate environment is not required.

## B. Data Preparation (Optional)

### Datasets and Splits

- **Datasets(Metrics)**
  - MoleculeNet regression (MAE): BACE, ESOL, FreeSolv, Lipophilicity
  - MoleculeNet classfication (AUROC):  BBBP, ClinTox, SIDER
  - DrugOOD classification (AUROC): EC50, IC50
  - Activity Cliffs regression (RMSE): Ki, EC50
  - 
- **Splits for OOD**
  - **Label-shift (Y-splits):** Default naming (e.g., `bace`)
    - By cutoff of top 5% on target values following Segal et al. 
  - **Covariate-shift (X-splits):** Add `_x` suffix (e.g., `bace_x`)
    - **Ours**
      - By spectral clustering on molecular cyclic skeletons following Tilborg et al. (refer to Appendix F for detailed analysis)
    - **DrugOOD datasets:**
      - `DrugOOD_ori`: Dataset with split from original reference (by molecular size)
      - `DrugOOD_ours`: Split using our method 
    - **Activity Cliffs:** 
      - Stored under `ac` (activity cliffs are considered OOD)
    - **Lo-Hi(Lead Optimization, Hit Identification) Splits**: 
      - Following Steshin et al., we use the lo-hi splitted data as OOD data. Named "lo" and "hi" (Under MoleculeNet).

### Quick Start with Pre-processed Data

All processed datasets used in the paper live under `data/`, so you can **skip regeneration for quick experiments**.

1. Download the pre-processed data from [this link](https://mega.nz/file/6UdwyJab#gRocqze_0jYZG63KeNhih9QoJ06Y5LnQjHlsRnLIcxA).
2. Unzip and place it under the `data/` folder.

```bash
unzip data.zip -d data/
```
### Preprocess Data from Scratch

Use the scripts in `process_data/`:

```bash
cd process_data
bash create_data.sh  # Change the key flags for dataset/split type/property
```

#### Key flags in `process_data.py`:
- `--property / -p`: Dataset name (e.g., `bace`, `esol_x`)
- `--dataset_split_type / -st`: Scaffold, MCES, or max-dis similarity split (`scaffold` | `mces` | `max_dis`). Default is `scaffold`
- `--data_rep / -dr`: Representation (`gnn` for GNN embeddings, `smi_ted` for SMILES TED embeddings, omit for RDKit descriptors)
- `--split_ratio / -or`: Out-of-distribution split proportion for Y-splits. Default is 0.05 as mentioned in the paper

**Generated artifacts are written to:** `data/<dataset>/<split>/<property>/` as both pickled datasets and CSV summaries.

### Extra Dataset Types

#### For Lo-Hi Splits (Table 16)
You can use Lo-Hi splits (splits explicitly targeting scenarios for drug discovery, refer to **Appendix L.2** and **Table 16**) by following:

```bash
cd process_data
bash create_lohi.sh
# This will create hi and lo splits under data/molnet. Only eval set will be created as done by the method of Steshin et al. (which is considered OOD). 
```

#### For Activity Cliffs Dataset Generation
For generating dataset for activity cliffs dataset:

```bash
# 1. Get the CSVs from https://github.com/molML/MoleculeACE/tree/main/MoleculeACE/Data/benchmark_data 
# 2. Download it to data/MoleculeACE_raw
# 3. Run the processing script
cd process_data
python process_ac.py
```

## C. Finetuning Inductive Models

Finetuned checkpoints for all baseline models for every seed are shipped under `baselines/` according to dataset type/split type/property name:

### Checkpoint Locations
- **GNN checkpoints:** `baselines/pretrained_gnns/saved_results/` (GIN encoder)
  - Example: `baselines/pretrained_gnns/saved_results/drugood_ori/scaffold/core_ec50`
- **Chemprop checkpoints:** `baselines/chemprop/saved_results/`

### Training Baselines

To reproduce or extend finetuning runs, launch `main.py` in the respective baseline directory. 

**Example:** To test on MoleculeNet BACE dataset for regression task with seed=42 using our split method:
```bash
cd baselines/pretrained_gnns
python main.py --seed 42 -dn molnet -ds scaffold -p bace -t regression
```

**Results and checkpoints are stored in:** `baselines/pretrained_gnns/results/<dataset_name>/<split>/<property>/<seed>/<timestamp>/`

### Batch Training
For an easy run, refer to `run_multi_seed.sh` in each baseline model folder. You can uncomment the run you would like to run and then it will run the entire finetuning.

**For fresh Chemprop finetuning runs or if you're encountering errors:** Spin up a separate conda environment and install `chemprop==1.6.1` (the version used in `environment.yml`) to keep its dependencies isolated. The Chemprop environment is only used when finetuning the model.

## D. Running MALT Variants (GIN, Chemprop)

MALT adapts pretrained inductive models (GIN or Chemprop etc.) before transduction as default. Refer to **Appendix H** for training tactics. Therefore, **we advise finetuning the inductive models before training**. For all batch drivers, uncomment the run you prefer.

### Model Configuration Options

- **Default model encoder:** Assumed to be the finetuned encoder (for each seed). For instance, if you've run `baselines/pretrained_gnns/run_multi_seed.sh`, you will have checkpoints under `baselines/pretrained_gnns/saved_results` and this will be the default encoder path for MALT run.

- **If you desire using the pretrained GIN without finetuning:** Change the config's `model.encoder_path` to `baselines/pretrained_gnns/model_gin/supervised_contextpred.pth`

- **For default Chemprop:** Set `model.path` to `None`

Each variant writes outputs to `results/<variant>/<dataset_name>/<split>/<property>/<seed>/<timestamp>/` with prediction pickles and checkpoints.

For an easy run, refer to `blt_<variant>_run.sh` file in each baseline model folder. You can uncomment the run you would like to run and then it will run the entire finetuning.

### MALT-GIN (GNN Encoder Backbone)

- **Batch driver:** `bash blt_graph_run.sh` (edits seeds, dataset, split method at top of the script)
<!-- - **Single run:** `python blt_graph_main.py --prop_type bace -ds scaffold --config_name topk_euc` -->
- **Configuration files:** Live in `configs/blt_graph/`. Adjust sampling strategy, memory bank size, and encoder paths there. You can also adjust number of anchor candidates.

### MALT-Chemprop (SMILES Encoder Backbone)

- **Batch driver:** `bash blt_chemprop_run.sh`
<!-- - **Single run:** `python blt_chemprop_main.py --prop_type bace -ds scaffold --config_name topk_euc` -->
- **Configuration:** Ensure `configs/blt_chemprop/` points to the proper Chemprop checkpoint and toggles encoder freezing as needed.

### MALT-RDKit (Feature Backbone)

- **Batch driver:** `bash blt_feature_run.sh`
<!-- - **Single run:** `python blt_feature_main.py --prop_type freesolv -ds scaffold --embedding_source None --config_name topk_euc` -->
- **Embedding source options:**
  - `None`: Raw RDKit descriptors
  - `gnn`: GNN embeddings
  - `smi_ted`: SMILES TED embeddings
- **Configuration:** Config templates live in `configs/blt_feature/`

## E. Running Transduction Baselines (BLT and its variants)

Original Bilinear Transduction (BLT) model ([https://github.com/learningmatter-mit/matex](https://github.com/learningmatter-mit/matex)) has been modified into a **GPU version for much faster inference and training**. We have verified that results were identical compared to the previous CPU version.

- `blt_main.py` and `blt_run.sh` runs BLT and saves results under `./results`

### Bilinear Transduction Baseline

- **Batch driver:** `bash blt_run.sh`
<!-- - **Single run example:** `python blt_main.py --prop_type freesolv -ds scaffold --embedding_source gnn --config_name blt` -->
- **Configs:** Live under `configs/blt/` and `configs/blt/generated/`

## F. Experiment Tracking & Outputs

### Weights & Biases Integration
Enable Weights & Biases logging by passing `--wandb_log` and `--proj_name <project>` on any training script.

### Output Files
Memory banks, checkpoints, and prediction files are created under each run directory. Check these key files for detailed artifacts:
- `train_deltas.pkl`: Training deltas/memory bank data
- `ckpts/`: Model checkpoints
- `<model>_<split>_preds.pkl`: Prediction files

## G. Repository Layout Highlights

```
├── baselines/          # Finetuned GNN and Chemprop code plus pretrained checkpoints
├── configs/           # YAML configuration files for all model families
├── process_data/      # Data preprocessing scripts and utilities
├── transducers/       # Core implementation of MALT modules
├── trainers/          # Trainer classes for each variant
├── models/           # Model architectures and components
├── utils/            # Utility functions and helpers
└── data/             # Processed molecular datasets
```

**Core Implementation Details:**
- `transducers/`: Includes memory bank creation, transducer code
- `trainers/`: Trainer for each variant
- `models/`: MALT modules implementation

## H. Troubleshooting Tips

### Common Issues & Solutions

**CUDA Compatibility:** Confirm CUDA 11.8 is available when using the default environment (`pytorch-cuda=11.8`).

**Data Synchronization:** If you modify `process_data/create_data.sh`, keep the property list synchronized with available raw data in `data/raw/`.

**Task Types:** Script defaults assume a regression task on MoleculeNet splits. Add `-t classification` for classification benchmarks and ensure corresponding configs are set.

**Model Paths:** Remember to change the paths of the finetuned inductive models!

### Notes
- Use pre-processed datasets for quick experimentation
- The GPU-optimized BLT provides much faster training than the original CPU version
- Ensure sufficient CPU for large molecular datasets

---

**Happy experimenting!** 