# General note:
The toy dataset and the pretrained models are not provided because of the restriction to 100 MB for ICLR Supplementary Materials. For the final publication, we will make all data and models available. 

# Input File Format Requirements

The input data must be provided in either CSV or XLSX format with the following columns:

- **Protein-Id**: The identifier for the protein
- **Ligand_SMILES** (optional): The SMILES representation of the ligand
- **output** (optional): The target variable

## Example Format

### CSV Example:
```csv
Protein-Id,Ligand_SMILES,output
P12345,CC(=O)NC1=CC=C(O)C=C1,0.85
P67890,CN1C=NC2=C1C(=O)N(C)C(=O)N2C,0.92
Q11111,CC1=CC=C(CC(=O)O)C=C1,0.76
```

### Excel (XLSX) Example:
| Protein-Id | Ligand_SMILES | output |
|------------|---------------|--------|
| P12345 | CC(=O)NC1=CC=C(O)C=C1 | 0.85 |
| P67890 | CN1C=NC2=C1C(=O)N(C)C(=O)N2C | 0.92 |
| Q11111 | CC1=CC=C(CC(=O)O)C=C1 | 0.76 |



# Downloading PDB Files

The `download_PDBs_from_UIDs.py` script downloads PDB files for proteins specified in your input files. If you have not already obtained the PDB files for the proteins in your train, validation, and test files, you can use this script to download them. If the Protein IDs in your input files (column "Protein-Id") are UniProt IDs, the script will automatically download the corresponding PDB files predicted by AlphaFold-2.

## Required Arguments

- `--train_file`: Path to the training data file (optional)
- `--val_file`: Path to the validation data file (optional)
- `--test_file`: Path to the test data file (optional)
- `--PDB_dir`: Directory where the downloaded PDB files will be saved

## Important Notes

- At least one input file (train_file, val_file, or test_file) must be specified
- The input files should contain a column named 'Protein-Id' with UniProt IDs
- The script supports both CSV and Excel (xlsx) file formats
- PDB files will be downloaded from AlphaFold's database and saved as `[UniProt_ID].pdb` in the specified PDB directory
- If a PDB file already exists in the target directory, it will be skipped to avoid redundant downloads

## Example Usage

To download PDB files for the toy dataset:
```bash
python download_PDBs_from_UIDs.py --train_file ../toy_dataset/toy_IC50_train.csv --val_file ../toy_dataset/toy_IC50_val.csv --test_file ../toy_dataset/toy_IC50_test.csv --PDB_dir ../toy_dataset/PDBs
```


# PDB File Preprocessing

The `preprocess_PDBs.py` script converts PDB files into a format suitable for model input. It processes the 3D structural information from PDB files and converts them into numpy lists storing the atom types and their coordinates. The atom types are encoded as integers and the coordinates are rounded to the specified resolution. This function will generate two subfolders in the save_dir: "numpy_3D_point_lists" and "protein_3D_bounding_boxes".

## Required Arguments

- `--save_dir`: Directory where the processed files will be saved. If small molecules are also processed, this save_dir should be identical to the save_dir specified when running preprocess_SMILES.py
- `--PDB_dir`: Directory containing the input PDB files
- `--resolution`: Resolution of 3D grid in Angstrom (default: 1.0)

## File Name Requirements

The PDB files should be named as `Protein_ID.pdb`, where `Protein_ID` corresponds to the entries in the "Protein-Id" column of your input files. The processed files will be saved as `Protein_ID.npy`.

## Memory Requirements

Storing processed files for 1000 proteins requires approximately 15GB of disk space. Please ensure you have sufficient storage available when processing large datasets.

## Batch Processing

For processing large numbers of PDB files, you can use batch processing with the following arguments:

- `--number_of_files`: Number of PDB files to process in each batch
- `--index`: Index of the batch to process (0-based)

If you want to process only a subset of PDB files with each function call, set the `--index` and `--number_of_files` arguments to the index of the subset and the number of PDB files to process in each subset, respectively. This can help you process the PDB files in smaller batches, which can be useful if you have a large number of PDB files or if you want to process the PDB files in parallel.

The total number of batches will be `ceil(total_number_of_PDBs / number_of_files)`. For example, if you have 10000 PDB files and want to process 1000 PDB files in each subset, you can set `--number_of_files` to 1000 and run the script 10 times with `--index` ranging from 0 to 9.

If you don't want to use batch processing, either don't set the index parameter or set it to -1.

## Example Usage

To process PDB files from the toy dataset (without batch processing) and assuming the PDBs are stored in ../toy_dataset/PDBs:
```bash
python preprocess_PDBs.py --save_dir ../toy_dataset/processed_data --PDB_dir ../toy_dataset/PDBs
```

## Output

The output of this function will be two subfolders in the save_dir: "numpy_3D_point_lists" and "protein_3D_bounding_boxes". The "numpy_3D_point_lists" subfolder will contain the numpy arrays storing the atom types and their coordinates, while the "protein_3D_bounding_boxes" subfolder will contain the bounding boxes for each protein.



# SMILES Preprocessing

The `preprocess_SMILES.py` script processes SMILES representations of molecules to generate embeddings using the Molformer model. These embeddings can be used as input features for the model.

## Required Arguments

- `--save_dir`: Directory where the processed files will be saved
- `--train_file`: Path to the training data file 
- `--val_file`: Path to the validation data file 
- `--test_file`: Path to the test data file 
- `--model_path`: Path to the pre-downloaded Molformer model (only required when running without internet access)
- `--continue_from_previous_file`: Set to True to continue from a previous interrupted run

## Important Notes

### Model Download
- If you have internet access: Do not set the `--model_path`. The Molformer model will be downloaded automatically on first execution (this may take some time)
- If executed without internet access: Download the Molformer model beforehand and specify its location using `--model_path`

### Input Files
- The input files (train_file, val_file, test_file) are optional, but at least one of them must be specified
- The input files should contain a column named 'Ligand_SMILES' with SMILES strings
- SMILES embeddings will be computed for each specified file

### Resuming Previous Runs
- If your processing was interrupted, you can resume by setting `--continue_from_previous_file` to True
- This will load existing SMILES embeddings and continue processing from where it left off

### Output
- The output of this function will be a dictionary storing the SMILES embeddings stored at save_dir/SMILES/SMILES_repr.npy

## Example Usage

To process SMILES from the toy dataset:
```bash
python preprocess_SMILES.py --save_dir ../toy_dataset/processed_data --train_file ../toy_dataset/toy_IC50_train.csv --val_file ../toy_dataset/toy_IC50_val.csv --test_file ../toy_dataset/toy_IC50_test.csv
```

# Training the ProteinVista model

The `training.py` script trains the ProteinVista model on your input data. It supports both protein-only and protein-ligand interaction (PLI) prediction tasks. The script utilizes PyTorch with mixed precision training and can run on single or multiple GPUs for parallel processing. 

## Required Arguments

- `--save_dir`: Directory where checkpoints, logs, and preprocessing outputs will be written.
- `--train_file`: Path to the training data file (CSV/XLSX).
- `--val_file`: Path to the validation data file used for early stopping and metric reporting.
- `--learning_rate`: Initial learning rate for the Adam optimiser.
- `--batch_size`: Mini-batch size per GPU.
- `--num_epochs`: Total number of training epochs.
- `--num_workers`: Number of CPU workers for the PyTorch dataloader.
- `--log_name`: Name prefix for the log file and saved checkpoints.

## Optional Arguments

- `--weight_decay` (default `0.01`): L2 regularisation strength.
- `--pretrained_model`: Path to a pretrained ProteinVista checkpoint to initialise the network.
- `--classification` (default `False`): Set to `True` for classification tasks; when `False` the model performs regression.
- `--balance_classes` (default `False`): Enable weighted sampling to address class imbalance in classification mode.
- `--pos_class_weight` (default `1.0`): Positive-class weight for the BCE loss when performing classification.
- `--PLI` (default `False`): Enable protein-ligand interaction training using Molformer-derived SMILES embeddings.
- `--fix_Molformer` (default `False`): Freeze Molformer parameters during training (useful when fine-tuning only the protein branch).
- `--port` (default `12576`): TCP port used by PyTorch for distributed training; change if the default is occupied.

## Output

During training, the script creates the following artefacts inside `--save_dir`:

- `${save_dir}/models/<log_name>_best_model.pth`: The best-performing model checkpoint (highest MCC for classification or R² for regression) saved from GPU 0.
- `${save_dir}/log_outputs/<log_name>.log`: Detailed training and validation logs including metrics for every epoch.

If training on multiple GPUs, only the primary process (GPU 0) writes checkpoints to disk to avoid race conditions.

## Example Usage

To train the ProteinVista model on the toy dataset:
```bash
python training.py --save_dir ../toy_dataset/processed_data --train_file ../toy_dataset/toy_IC50_train.csv --val_file ../toy_dataset/toy_IC50_val.csv --log_name "toy_IC50" --learning_rate 0.001 --batch_size 2 --num_epochs 10 --num_workers 0 --weight_decay 0.01 --pretrained_model "../models/ProteinVista_122M.pth" --PLI --fix_Molformer
```


### Inference

The `inference.py` script generates predictions for new protein (and optionally ligand) pairs using a trained ProteinVista checkpoint. It reads the pre-processed 3D protein grids and, if `--PLI True`, the SMILES embeddings stored in `--save_dir`, runs the model, and stores the results to disk.

#### Required Arguments

- `--save_dir`: Directory containing the pre-processed data and where predictions will be written.
- `--input_file`: CSV/XLSX file listing the samples to score. Must contain a **Protein-Id** column and, when `--PLI True`, a **Ligand_SMILES** column.
- `--pretrained_model`: Path to a trained checkpoint (`*.pth`).
- `--batch_size`: Batch size per GPU.
- `--num_workers`: Number of dataloader worker processes.

#### Optional Arguments

- `--classification` (default `False`): Set to `True` for classification checkpoints (outputs are sigmoid probabilities).
- `--PLI` (default `False`): Enable protein-ligand mode (requires SMILES embeddings).
- `--num_iterations` (default `1`): Number of forward passes per sample; predictions are averaged over iterations to smooth stochastic augmentations.
- `--log_name`: Prefix for the inference log file written to `${save_dir}/log_outputs`.
- `--port` (default `12576`): Port used for distributed inference across multiple GPUs.

#### Output

The script writes a new folder `${save_dir}/predictions` containing:

- `predictions.npy` – NumPy dictionary mapping `<ProteinID>_<LigandID>` to the predicted value (float or probability).
- `predictions.csv` – Copy of `--input_file` with an added **Prediction** column.



To quickly inspect the results:

```bash
# Preview first 5 rows of the CSV
head -n 5 ${save_dir}/predictions/predictions.csv

# Inspect the NumPy dictionary in Python
python - << 'PY'
import numpy as np, os, pprint
pred = np.load(os.path.join("${save_dir}", "predictions", "predictions.npy"), allow_pickle=True).item()
pprint.pprint(list(pred.items())[:5])
PY
```

#### Example Usage

```bash
python inference.py \
  --save_dir ../toy_dataset/processed_data \
  --input_file ../toy_dataset/toy_IC50_test.csv \
  --batch_size 6 \
  --PLI \
  --num_iterations 2 \
  --pretrained_model ../toy_dataset/processed_data/models/toy_IC50_best_model.pth \
  --num_workers 0 \
  --log_name "toy_IC50_inference"
```

### Extract representations
