# HAP-E: Hessian-Aware Structured Pruning of LLMs for Efficient Inference

[![Python 3.10](https://img.shields.io/badge/python-3.10-blue.svg)](https://www.python.org/downloads/release/python-3100/)
[![PyTorch](https://img.shields.io/badge/PyTorch-2.8.0-red.svg)](https://pytorch.org/)
[![Transformers](https://img.shields.io/badge/Transformers-4.52.3-green.svg)](https://huggingface.co/transformers/)

A research implementation of **HAP-E** (Hessian-Aware Structured Pruning of LLMs for Efficient Inference), an adaptive, Hessian-aware structured pruning framework designed to compress large language models (LLMs) to meet user-specified hardware latency targets while maintaining accuracy.

## 🚀 Overview

HAP-E is an adaptive, Hessian-aware structured pruning framework that operates entirely **post-training** through an **iterative loop**. In each iteration, it progressively removes the least important structural blocks until the measured latency satisfies the predefined constraint.

### Core Methodology

The framework operates through **four distinct stages** in each iteration:

1. **🔍 Lightweight Importance Estimation**: Each structural block (attention heads, FFN neurons) is assigned an inexpensive saliency score based on parameter magnitude, providing a quick estimate of importance.

2. **📊 Sensitivity Analysis**: Estimates the tolerance of each layer to perturbations using a recursive Hessian-based approximation that captures both local effects within a layer and propagated effects across the network.

3. **🎯 Candidate Selection and Refinement**: A candidate budget is allocated across layers, considering their sensitivity and variability. These candidates are then refined using exact Optimal Brain Surgeon (OBS) scores computed efficiently from partial Hessian solves.

4. **⚡ Greedy-Consistent Batch Pruning**: Certifies the largest set of blocks that a greedy OBS approach would remove sequentially, then prunes them jointly in a single step. This guarantees equivalence to one-by-one greedy OBS while requiring far fewer weight updates.

### Key Benefits

By combining coarse-grained heuristics for global ranking with selective, exact OBS for small candidate subsets, HAP-E efficiently concentrates expensive second-order computation where it yields the most benefit. This approach avoids full Hessian recomputation and terminates as soon as the latency target is achieved, resulting in a hardware-aware, scalable pruning algorithm that maintains high accuracy under strict inference budgets.

## ✨ Key Features

- **🔬 Advanced Pruning Algorithms**: Implements both greedy OBS and certified hybrid pruning
- **📊 Hessian-Aware Scoring**: Uses second-order information for optimal block selection
- **🛡️ Certified Guarantees**: Provides mathematical guarantees on pruning quality
- **⚡ Edge Optimization**: Designed for efficient pruning on resource-constrained devices
- **📈 Comprehensive Evaluation**: Built-in support for standard LLM evaluation benchmarks
- **🔧 Flexible Configuration**: YAML-based configuration system for easy experimentation
- **🧪 Extensive Testing**: Comprehensive test suite with unit and integration tests

## 🔬 HAP-E Algorithm

The HAP-E framework implements Algorithm 1, which iteratively prunes a pre-trained model until it meets the target latency constraint:

### Algorithm Overview

**Input**: Pre-trained model `M`, target latency `Lat_target`, calibration dataset `D_cal`  
**Output**: Pruned model `M_pruned`

The algorithm operates through the following iterative process:

```python
while Lat(M) > Lat_target:
    # 1. Lightweight importance estimation
    Imp(B_i) ← √(1/|W_i| * Σ_{w∈W_i} w²)
    
    # 2. Layer sensitivity estimation (recursive)
    S^(ℓ)→(ℓ+1) ← Tr(((X^(ℓ+1))^T X^(ℓ+1)) + λI)
    S^(ℓ) = S^(ℓ)→(ℓ+1) + βS^(ℓ+1)
    
    # 3. Candidate budget allocation
    CV^(ℓ,τ) ← σ^(ℓ,τ) / μ^(ℓ,τ)
    K^(ℓ,τ) ← min(CK, N^(ℓ,τ)) * (CV^(ℓ,τ) / (S^(ℓ) + ε))
    
    # 4. OBS scoring with partial inverse
    Solve HX = E_Π for candidate panel Π
    G_Π,Π ← (G_{:,Π})^T E_Π
    E(B_s) ← Σ_j W_{S,j}^T (G_{SS})^{-1} W_{S,j}
    Ẽ(B_s) ← S^(ℓ) * E(B_s)
    
    # 5. Certify greedy-consistent batch and prune
    A'_c ← G_{cc} - G_{cJ} G_{JJ}^{-1} G_{Jc}
    E'(c|J) ← ||(A'_c)^{-1/2} W_{c,:}||_F²
    ΔW_R ← -H_{RP} H_{PP}^{-1} W_P; W_{P,:} ← 0
    
    # 6. Incremental Hessian update
    Q ← Π \ P
    G_QQ ← G_QQ - G_{QP} G_{PP}^{-1} G_{PQ}
    
    # 7. Latency update
    Measure Lat(M)
```

### Mathematical Components

- **Importance Score**: `Imp(B_i) = √(1/|W_i| * Σ_{w∈W_i} w²)` - Root mean square of weights within block
- **Layer Sensitivity**: `S^(ℓ) = S^(ℓ)→(ℓ+1) + βS^(ℓ+1)` - Recursive sensitivity propagation
- **Budget Allocation**: `K^(ℓ,τ) = min(CK, N^(ℓ,τ)) * (CV^(ℓ,τ) / (S^(ℓ) + ε))` - Adaptive budget based on sensitivity
- **OBS Error**: `E(B_s) = Σ_j W_{S,j}^T (G_{SS})^{-1} W_{S,j}` - Optimal Brain Surgeon error calculation
- **Schur Complement**: `A'_c = G_{cc} - G_{cJ} G_{JJ}^{-1} G_{Jc}` - Efficient Hessian updates

## ⏱️ Two-Stage Latency Estimation

HAP-E includes a sophisticated two-stage learned latency model for accurate runtime prediction:

### Stage 1: Module-Level Latency Prediction
- **Feature Vector**: `x^(ℓ) = [S, d_model, h^(ℓ), d_ffn^(ℓ)]` (Equation 14)
- **Separate Regressors**: `f_MHA` and `f_FFN` for Multi-Head Attention and Feed-Forward Network
- **Linear Regression**: Trained on hardware-specific measurements

### Stage 2: Total Model Latency Aggregation
- **Aggregation Formula**: `L̂_tot(A) = α₀ + Σ_{b=1}^B α_b f_τ(b)(x_b)` (Equation 15)
- **Non-additive Effects**: Captures memory allocation and kernel fusion
- **Architecture-Specific**: Tailored for Transformer architectures

### Key Benefits
- **Hardware-Aware**: Eliminates need for repeated on-device profiling
- **Efficient**: Fast prediction during pruning iterations
- **Accurate**: Captures variation across sequence length and model width
- **Scalable**: Works with different model sizes and configurations

### Usage Example
```python
from inference.latency import LatencyInferenceMetric, ModuleLatencyData

# Create latency metric
latency_metric = LatencyInferenceMetric(target_speedup=2.0)

# Add calibration data
module_data = ModuleLatencyData(
    module_type="MHA",
    sequence_length=2048,
    d_model=1024,
    h=16,
    d_ffn=0,
    latency_ms=5.2
)
latency_metric.add_calibration_data(module_data)

# Train the model
latency_metric.train_latency_model()

# Predict latency
predicted_latency = latency_metric.compute_pruned_inference(model_layers, layer_configs)
```

## 📋 Requirements

- Python 3.10.0
- CUDA-compatible GPU (recommended)
- 16GB+ RAM (for larger models)
- 50GB+ disk space (for model cache and datasets)

## 🛠️ Installation

### Quick Setup (Recommended)

```bash
# Clone the repository
git clone <repository-url>
cd <repository-name>

# Note: This is an anonymized version for ICLR submission
# Replace paths and tokens as needed for your setup

# Run the quick setup script
./quick_setup.sh
# Or run the advanced setup
./setup_halpe_env.sh

# Activate the environment
source activate_halpe.sh
```

### Manual Setup

```bash
# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh
export PATH="$HOME/.local/bin:$PATH"

# Create virtual environment
uv venv halpe_uv --python 3.10.0

# Activate environment
source halpe_uv/bin/activate

# Install dependencies
pip install -r requirements_halpe.txt
```

### Alternative: Using Conda

```bash
# Create conda environment
conda create -n halpe python=3.10.0
conda activate halpe

# Install PyTorch with CUDA support
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia

# Install other dependencies
pip install -r requirements_halpe.txt
```

## 🚀 Quick Start

### Basic Usage

1. **Configure your experiment** by editing `config.yaml`:
```yaml
# Example configuration
model_name_or_path: "TinyLlama/TinyLlama_v1.1"
dataset_name: "allenai/c4"
inference_speedup: 0.0
num_blocks_to_prune: 0.05
max_iterations: 120
```

2. **Run pruning and evaluation**:
```bash
# Activate environment
source activate_halpe.sh

# Configure your HuggingFace token (required for model access)
# Edit main.py and uncomment the login line with your token

# Run the main pipeline
python main.py config.yaml
```

### Advanced Usage

```bash
# Run with custom configuration
python main.py --config custom_config.yaml

# Run specific components
python -m prune.prune --help
python -m inference.inference --help
```

## 📁 Project Structure

```
ICLR_submission/
├── main.py                          # Main execution script
├── config.yaml                      # Default configuration
├── requirements_halpe.txt           # Python dependencies
├── activate_halpe.sh               # Environment activation script
├── quick_setup.sh                  # Quick setup script
├── setup_halpe_env.sh              # Detailed setup script
│
├── prune/                          # Core pruning algorithms
│   ├── halpe.py                    # Main HAP-E implementation
│   ├── hybrid_obs_pruner_certified.py  # Certified OBS pruner
│   ├── layer_prune.py              # Layer-specific pruning logic
│   ├── hessian_inverse.py          # Hessian computation utilities
│   └── utils.py                    # Pruning utilities
│
├── inference/                      # Inference and evaluation
│   ├── inference.py                # Model inference
│   ├── latency.py                  # Two-stage learned latency model
│   └── sparsity.py                 # Sparsity analysis
│
├── delay_model/                    # Delay modeling components
│   ├── delay_model.py              # Core delay model
│   ├── latency_estimator.py        # Latency estimation
│   └── look_up_table.py            # Lookup tables
│
├── utils/                          # Utility modules
│   ├── argument_parser.py          # Command-line argument parsing
│   ├── dataset_utils.py            # Dataset handling
│   ├── model_utils.py              # Model utilities
│   ├── pruning_utils.py            # Pruning configuration
│   └── logs.py                     # Logging utilities
│
├── test/                           # Test suite
│   ├── test_halpe.py               # HAP-E algorithm tests
│   ├── test_obs_hybrid_pruner_certified.py  # Certified pruner tests
│   └── test_obs_vs_certify.py     # Algorithm comparison tests
│
└── calibration_dataset/            # Calibration data
    └── hybrid_calib_256x2000.pt   # Pre-computed calibration dataset
```

## ⚙️ Configuration

The framework uses YAML configuration files. Key parameters:

### Model Configuration
```yaml
model_name_or_path: "TinyLlama/TinyLlama_v1.1"  # HuggingFace model
model_type: "causal_lm"                          # Model architecture
torch_dtype: "auto"                              # Data type
```

### Pruning Configuration
```yaml
inference_speedup: 0.0                           # Target speedup
num_blocks_to_prune: 0.05                       # Fraction of blocks to prune
num_candidate_blocks: 0.2                       # Fraction of candidate blocks
max_iterations: 120                             # Maximum pruning iterations
alpha: 1.0                                      # Sensitivity scaling factor
```

### Dataset Configuration
```yaml
dataset_name: "allenai/c4"                      # Calibration dataset
num_samples: 256                                # Number of calibration samples
max_seq_length: 2048                            # Maximum sequence length
batch_size: 4                                   # Batch size for calibration
```

## 📊 Evaluation

The framework includes comprehensive evaluation using standard benchmarks:

- **BoolQ**: Boolean question answering
- **PIQA**: Physical interaction question answering
- **HellaSwag**: Commonsense reasoning
- **WinoGrande**: Pronoun resolution
- **ARC**: AI2 reasoning challenge
- **OpenBookQA**: Open-book question answering

Results are automatically saved and logged for analysis.

## 🧪 Testing

Run the comprehensive test suite:

```bash
# Run all tests
pytest test/

# Run specific test modules
pytest test/test_halpe.py
pytest test/test_obs_hybrid_pruner_certified.py
pytest test/test_obs_vs_certify.py

# Run with verbose output
pytest -v test/
```

## 🔧 Troubleshooting

### Common Issues

1. **CUDA Out of Memory**:
   - Reduce `batch_size` in config
   - Enable `use_chunking: true`
   - Reduce `chunk_size`

2. **Model Loading Issues**:
   - Check HuggingFace token authentication
   - Verify model name and availability
   - Ensure sufficient disk space for cache

3. **Calibration Data Issues**:
   - Verify dataset accessibility
   - Check internet connection for dataset download
   - Ensure sufficient disk space

### Debug Mode

Enable detailed logging:
```yaml
# In config.yaml
seed: 1000
num_workers: 1  # Reduce for debugging
```


## 📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

## 🙏 Acknowledgments

- HuggingFace for the transformers library
- PyTorch team for the deep learning framework
- The open-source community for various utilities and tools

---

**Note**: This is a research implementation of HAP-E. For production use, additional testing and optimization may be required.
