# HAIPR: High-throughput Affinity Prediction

HAIPR is a comprehensive toolkit for protein sequence optimization using ESM language models and genetic algorithms. It combines efficient parameter-efficient fine-tuning (PEFT) with parallel sequence evaluation to enable high-throughput in-silico screenings.


## Setup

To get started with HAIPR, please follow these steps:

1. **Set `$DATA_HOME`**  
   Ensure the `$DATA_HOME` environment variable points to a directory with ample free space (at least 1TB recommended). Protein embeddings can consume significant storage, especially when generating them for multiple embedders and benchmark datasets.

2. **Install Clang**  
   Clang is required for building some dependencies.  
   - On Ubuntu: `sudo apt-get install clang`  
   - On macOS: `brew install llvm`


3. **Install [uv](https://github.com/astral-sh/uv) and Project Dependencies**  
   Install `uv` (a fast Python package manager) and use it to install the HAIPR package
   From here run:
   ```bash
   curl -LsSf https://astral.sh/uv/install.sh | sh
   uv venv
   source .venv/bin/activate
   uv pip install .
   ```
   
4. **Authenticate with Hugging Face**  
   Log in to Hugging Face to enable downloading of ESM models:  
   When asked, provide an access token with READ permissions. 
   ```bash
   hf auth login
   ```
   

## Pipeline Overview

The HAIPR pipeline consists of four main components executed sequentially:

1. **Feature Preparation**: Generate protein embeddings using ESM models
2. **Hyperparameter Optimization**: Optimize model parameters using Optuna
3. **Model Training**: Train ML models on prepared features
4. **Sequence Inference**: Generate optimized protein sequences using genetic algorithms

### Pipeline Execution

```bash
python -m haipr.haipr
```

The pipeline can be configured to run specific stages using the `stages` parameter. Available stages are: `features`, `optimize`, `train`, and `inference`.

## Core Components

### Data Processing (`data.py`)

The `HAIPRData` class handles protein sequence data preparation and feature generation:

- **Sequence Validation**: Validates input sequences and handles missing data
- **Feature Caching**: Caches computed embeddings to avoid recomputation
- **Data Splitting**: Implements multiple cross-validation strategies (CV, LOMO, OOD)
- **Embedding Generation**: Computes protein embeddings using ESM models

Key methods:
- `prepare_features()`: Generates and caches protein embeddings
- `generate_splits()`: Creates train/validation splits
- `get_sequences()`: Returns protein sequences
- `get_labels()`: Returns corresponding fitness labels

### Training (`train.py`)

The `HAIPRTrainer` class manages model training and evaluation:

- **Multi-Model Support**: Handles ESM, SVR, SVC, and MLP models
- **Parallel Training**: Supports multi-GPU training with PyTorch Lightning
- **Cross-Validation**: Runs training across multiple data splits
- **MLflow Integration**: Logs metrics, models, and artifacts

Key methods:
- `tune()`: Main training entry point
- `run_splits()`: Executes training across data splits
- `evaluate_trial()`: Evaluates specific Optuna trial configurations

### Hyperparameter Optimization (`optimize.py`)

The `HAIPROptimizer` class manages hyperparameter optimization using Optuna:

- **Study Management**: Creates and manages Optuna studies
- **Storage Backends**: Supports SQLite, MySQL, PostgreSQL, and in-memory storage
- **Trial Evaluation**: Runs training trials with different parameter configurations
- **Cross-Validation**: Supports optimization across multiple test splits

Key methods:
- `optimize()`: Main optimization entry point
- `cv_optimize()`: Cross-validated optimization across splits
- `evaluate_trial()`: Evaluates specific trial configurations

### Sequence Inference (`inference.py`)

The `HAIPRInference` class generates optimized protein sequences:

- **Genetic Algorithm**: Uses PyGAD for population-based sequence evolution
- **Multi-Stage Evaluation**: Supports filter and scoring evaluators
- **Model Loading**: Automatically loads trained models from MLflow
- **Batch Processing**: Efficiently evaluates sequence populations

Key methods:
- `run()`: Main inference pipeline using genetic algorithms
- `score_sequences()`: Scores provided sequences
- `load_predictors()`: Loads trained models from MLflow

## Configuration

HAIPR uses Hydra for configuration management. Configuration files are located in `conf/`:

- `haipr.yaml`: Main pipeline configuration
- `train.yaml`: Training configuration
- `optimize.yaml`: Optimization configuration
- `inference.yaml`: Inference configuration
- `data.yaml`: Data processing configuration

## Setup

1. Set `$DATA_HOME` to a directory with sufficient storage (500GB+ recommended)
2. Login to Hugging Face to download ESM models
3. Download the BindingGYM dataset

## Usage Examples

### Complete Pipeline
```bash
python -m haipr.haipr
```

### Training Only
```bash
python -m haipr.train
```

### Optimization Only
```bash
python -m haipr.optimize
```

### Inference Only
```bash
python -m haipr.inference mlflow.experiment_id="YOUR_EXPERIMENT_ID"
```

## Output

The system generates FASTA files with optimized sequences ranked by fitness:

```fasta
>fitness_0.954321_rank_1
MKQLEDKVEELLSKNYHLENEVARLKKLVGER...
>fitness_0.932156_rank_2
MKQLEDKVEELLSKNYHLENEVARLKKLVGET...
```

## License

MIT License

## Citation
----
