# L3_PPI: Protein-Protein Interaction Prediction with Graph Neural Networks and Prompt Learning

A deep learning framework for predicting protein-protein interactions (PPIs) using graph neural networks enhanced with prompt learning techniques.

## Overview

This project implements a novel approach to protein-protein interaction prediction by combining:
- **Graph Neural Networks (GNNs)** for capturing protein interaction patterns
- **Prompt Learning** for improved model adaptability
- **Multi-level validation** using different data splitting strategies

## Features

- 🧬 **Multiple Model Architectures**: Prompt-enhanced GNN, standard GNN, and MLP baselines
- 📊 **Comprehensive Evaluation**: BFS, DFS, and random data splitting for robust validation  
- 🔧 **Flexible Training**: Configurable hyperparameters and training strategies
- 📈 **Advanced Techniques**: Label smoothing, gradient clipping, and adaptive thresholds
- 🎯 **Class Balance Protection**: Built-in mechanisms to prevent prediction bias

## Installation

### Requirements
```bash
pip install torch torch-geometric
pip install numpy pandas scikit-learn
pip install wandb tqdm termcolor
```

### Dependencies
- Python >= 3.7
- PyTorch >= 1.8.0
- PyTorch Geometric >= 2.0.0
- NumPy, Pandas, Scikit-learn
- Weights & Biases (wandb) for experiment tracking

## Data Structure

```
data/
├── protein.actions.yeast.tsv              # Yeast PPI data
├── protein.yeast.sequences.dictionary.tsv # Protein sequence dictionary  
├── vec5_CTC.txt                           # Protein vector representations
├── protein.actions.SHS27k.STRING.txt     # Alternative dataset (27k)
└── protein.actions.SHS148k.STRING.txt    # Alternative dataset (148k)
```

## Quick Start

### Basic Usage
```bash
python run_prompt_binding.py
```

### Advanced Training
```bash
python gnn_prompt_train_binding.py \
    --description="random" \
    --ppi_path="data/protein.actions.yeast.tsv" \
    --pseq_path="data/protein.yeast.sequences.dictionary.tsv" \
    --vec_path="data/vec5_CTC.txt" \
    --model_type="prompt" \
    --batch_size=128 \
    --epochs=3000 \
    --lr=0.0003
```

## Model Architectures

### 1. Prompt-Enhanced Model (`InactivePromptBinding`)
- **Architecture**: GIN layers with learnable prompt tokens
- **Key Features**: 
  - Adaptive gate mechanism for token selection
  - Bidirectional GRU for sequence processing
  - Dynamic parameter freezing/unfreezing

### 2. Standard GNN Model (`GIN_Net2`)
- **Architecture**: Graph Isomorphism Networks
- **Features**: Multi-layer GIN with jumping knowledge connections

### 3. MLP Baseline (`SimpleMLP`)
- **Architecture**: Simple feed-forward network
- **Purpose**: Baseline comparison for graph-based methods

## Training Configuration

### Model-Specific Parameters

**Prompt Model:**
```python
batch_size = 128
epochs = 3000
num_token = 8
hidden = 128
lr = 0.0003
gin_num_layer = 2
th_epoch = 300  # Gate mechanism activation epoch
```

**MLP Model:**
```python
batch_size = 32
hidden = 256
lr = 0.00005
gin_num_layer = 3
th_epoch = 50
```

**GNN Model:**
```python
batch_size = 64
hidden = 256
lr = 0.003
gin_num_layer = 3
th_epoch = 50
```

## Data Splitting Strategies

- **Random**: Random train/validation/test split
- **BFS**: Breadth-first search based splitting
- **DFS**: Depth-first search based splitting

Each strategy provides different evaluation perspectives for model generalization.

## Project Structure

```
L3_PPI/
├── gnn_data.py                    # Data loading and preprocessing
├── gnn_model.py                   # GNN and MLP model definitions
├── prompt_model.py                # Prompt-enhanced model architectures
├── gnn_prompt_train_binding.py    # Main training script
├── run_prompt_binding.py          # Simplified training interface
├── utils.py                       # Evaluation metrics and utilities
├── inference.py                   # Model inference utilities
├── data/                          # Dataset directory
├── train_valid_index_json/        # Pre-computed data splits
└── save_model/                    # Model checkpoints
```

## Key Components

### Data Processing (`gnn_data.py`)
- **GNN_DATA_Binding**: Main data loader class
- **Features**: Automatic negative sampling, balanced dataset creation
- **Formats**: Supports multiple PPI data formats

### Training (`gnn_prompt_train_binding.py`)
- **Multi-model support**: Prompt, GNN, MLP, PIPR
- **Advanced training**: Label smoothing, gradient clipping
- **Monitoring**: Comprehensive logging with wandb integration

### Evaluation (`utils.py`)
- **Metrictor_PPI**: Precision, Recall, F1-score calculation
- **GateLoss**: Custom loss function for prompt learning
- **Utilities**: Negative sampling, graph splitting algorithms

## Training Features


### Advanced Training Techniques
- **Gate Mechanism**: Dynamic prompt token selection
- **Parameter Scheduling**: Freeze/unfreeze strategies
- **Early Stopping**: Overfitting prevention
- **Learning Rate Scheduling**: Adaptive learning rate adjustment

## Monitoring and Logging

Integration with Weights & Biases (wandb) for:
- Real-time training metrics
- Model performance tracking
- Hyperparameter optimization
- Experiment comparison

## Results and Evaluation

The framework evaluates models on multiple validation sets:
- **BS (Balanced Set)**: Random validation split
- **ES (External Set)**: External validation data
- **NS (Negative Set)**: Negative sample validation
- **Overall**:

Metrics tracked:
- Precision, Recall, F1-score
- Training/validation loss
- Gate mechanism statistics (for prompt models)
- Data balance diagnostics

## Usage Examples

### 1. Training with Default Settings
```bash
python run_prompt_binding.py
```

### 2. Custom Model Training
```python
# Modify run_prompt_binding.py
model_type = "prompt"  # or "gnn", "mlp"
description = "random"  # or "bfs", "dfs"
```

### 3. Hyperparameter Tuning
```bash
python gnn_prompt_train_binding.py \
    --model_type="prompt" \
    --hidden=256 \
    --num_token=16 \
    --lr=0.001 \
    --batch_size=64
```


## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## Contributing

1. Fork the repository
2. Create your feature branch (`git checkout -b feature/AmazingFeature`)
3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)
4. Push to the branch (`git push origin feature/AmazingFeature`)
5. Open a Pull Request
