# README: Supplementary Code for MagicDock

## Overview
This supplementary material provides the implementation for the molecular docking framework described in the submitted manuscript. The codebase processes protein and ligand data, performs unsupervised pre-training, supervised fine-tuning, and generates ligand structures, aligning with the theoretical framework outlined in the paper. The implementation is modular, leveraging established libraries for molecular data processing, point cloud generation, and machine learning tasks, with a focus on SE(3)-equivariant modeling.

## System Requirements
The code has been tested on the following environment:
- **OS/Kernel**: Ubuntu 22.04.5 LTS, Linux 5.15.0-126-generic
- **CUDA**: Runtime 12.8 (from driver), Toolkit 12.4 (`nvcc`)
- **Python/Conda**: Python 3.11.10
- **Deep Learning Stack**: PyTorch 2.1.0 (cu118 build), torchvision 0.16.0, torchaudio 2.1.0
- **Additional Dependencies**:
  - BioPython, RDKit, Open3D, PyRosetta (for molecular and structural processing)
  - NumPy, Pandas, SciPy, scikit-learn, Matplotlib (for data handling and visualization)
  - torch-geometric (for graph-based neural networks)
  - pyrosetta (for ligand refinement and energy calculations)

Exact package versions can be inferred from standard documentation or recreated in a compatible Conda environment.

## Code Structure
The codebase consists of several Python scripts, each corresponding to a component of the docking pipeline:

- **`process_skempi_v2.py`**: Processes the SKEMPI v2.0 dataset to generate point cloud representations for protein-protein interactions, including atomic coordinates, chemical features, and binding affinity (\(\Delta G\)).
- **`compute_pointcloud.py`**: Generates point clouds and features (e.g., normals, chemical properties) from PDB files, incorporating interface labels for protein structures.
- **`compute_pointcloud-ligand.py`**: Extends point cloud generation to ligands (proteins or small molecules), supporting PDB and SDF inputs for compatibility with diverse datasets.
- **`process_pdbbind.py`**: Processes the PDBbind dataset, generating point clouds and features for protein-ligand complexes, including SMILES representations for small molecules.
- **`unsupervised_pre_training.py`**: Implements unsupervised pre-training using a VQ-VAE architecture to learn SE(3)-equivariant latent representations of molecular surfaces.
- **`unsupervised_pre_training_11dim_SE3.py`**: A variant of pre-training optimized for 11-dimensional feature inputs, incorporating SE(3)-equivariant convolutions and pair-based contrastive learning.
- **`supervised_fine_tuning.py`**: Performs supervised fine-tuning of the pre-trained model for protein-protein interaction prediction, optimizing for pocket identification and binding affinity.
- **`supervised_fine_tuning_protein.py`**: Extends fine-tuning to protein-ligand interactions, incorporating pocket labels and chemical constraints.
- **`ligand_generation.py`**: Implements the gradient-driven inversion framework for ligand generation, optimizing structures to achieve target binding affinities.
- **`ligand_generation-pro.py`**: An enhanced ligand generation module, integrating PyRosetta for structural refinement and binding energy evaluation.

## Usage
Each script is designed to be run independently with configurable input/output paths. Placeholder paths (e.g., `""`) in the code should be replaced with actual file paths specific to your dataset and environment. Example usage:

```bash
python process_skempi_v2.py
python compute_pointcloud.py
python unsupervised_pre_training.py
python supervised_fine_tuning_protein.py
python ligand_generation-pro.py
```

```bash
python process_pbdbind.py
python compute_pointcloud-ligand.py
python unsupervised_pre_training_11dim_SE3.py
python supervised_fine_tuning.py
python ligand_generation.py
```

Scripts assume access to datasets (e.g., SKEMPI v2.0, PDBbind) in PDB, SDF, or PLY formats. The pipeline processes raw molecular data, generates point clouds, trains models, and produces optimized ligand structures, as described in the manuscript.

### Setup Instructions
1. **Environment Setup**:
   ```bash
   conda create -n docking python=3.11.10
   conda activate docking
   conda install pytorch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 pytorch-cuda=11.8 -c pytorch -c nvidia
   pip install biopython rdkit open3d torch-geometric
   ```
   Install PyRosetta separately following its official documentation.

2. **Data Preparation**:
   - Prepare SKEMPI v2.0 or PDBbind datasets in the appropriate format.
   - Update file paths in each script to point to your dataset and output directories.

3. **Running the Pipeline**:
   - Start with data processing (`process_skempi_v2.py` or `process_pdbbind.py`).
   - Perform pre-training (`unsupervised_pre_training.py` or `unsupervised_pre_training_11dim_SE3.py`).
   - Fine-tune the model (`supervised_fine_tuning.py` or `supervised_fine_tuning_protein.py`).
   - Generate ligands (`ligand_generation.py` or `ligand_generation-pro.py`).

## Notes for Reviewers
- **Modularity**: The codebase is structured to mirror the manuscript’s pipeline (data processing, pre-training, fine-tuning, and generation), facilitating verification of the proposed methods.
- **Anonymity**: File paths and model checkpoints are placeholders to maintain anonymity, as per submission guidelines.
- **Reproducibility**: The code includes robust error handling, logging, and progress tracking. Reviewers may test functionality by replacing placeholders with sample data paths.
- **Limitations**: Scripts rely on external datasets and tools (e.g., PyRosetta). Reviewers should focus on the algorithmic logic and its alignment with the manuscript’s theoretical claims.
- **Environment**: The provided environment (Ubuntu 22.04.5, CUDA 12.4, PyTorch 2.1.0) ensures compatibility. Minor adjustments may be needed for different CUDA or Python versions.

## License
This code is provided solely for academic review and should not be distributed or used outside the context of the manuscript evaluation.