# SERPANT: Sequential E-value based Ranking with Pairwise ANalysis and Transitivity

<div align="center">

**A Statistical Framework for Large Language Model Ranking Inference**

[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)

## 📋 Table of Contents

- [Introduction](#introduction)
- [Key Features](#key-features)
- [Installation](#installation)
- [Quick Start](#quick-start)
  - [1. Simulation Mode](#1-simulation-mode)
  - [2. Real Data Mode](#2-real-data-mode)
- [User Guide](#user-guide)
- [Configuration](#configuration)
- [Project Structure](#project-structure)
- [Examples](#examples)
- [FAQ](#faq)
- [Citation](#citation)

---

## 🎯 Introduction

SERPANT is a sequential statistical testing framework based on e-values, designed specifically for ranking inference of Large Language Models (LLMs). The framework efficiently infers partial order relationships between models through pairwise comparisons and transitivity propagation, while controlling the Family-Wise Error Rate (FWER).

### Main Use Cases

- 🏆 **LLM Ranking**: Compare multiple LLMs and establish partial order relationships
- 🔝 **Top-K Model Discovery**: Identify the top-K best performing models
- 📊 **Experimental Analysis**: Evaluate the impact of different algorithm parameters on FWER and Power
- 🧪 **Methodological Research**: Support covariate-assisted methods and various sampling strategies

---

## ✨ Key Features

### Algorithm Features
- ✅ **FWER Control**: Rigorous control of Family-Wise Error Rate
- ✅ **Sequential Sampling**: Support for `all_active`, `random_pair`, `tournament` sampling strategies
- ✅ **Transitivity Propagation**: Automatic application of transitivity rules to accelerate inference
- ✅ **Top-K Confidence Sets**: Construction of confidence sets for Top-K models
- ✅ **Covariate-Assisted**: Leverage covariate information to improve testing efficiency
- ✅ **Checkpoint Recovery**: Resume experiments from checkpoints after interruption

### Mode Support
- 🎲 **Simulation Mode**: For algorithm research and method comparison
- 🚀 **Real Data Mode**: Support for real LLM comparison experiments
  - Hugging Face model loading
  - Multiple dataset support (MMLU, TriviaQA, custom datasets)
  - Flexible judging methods (GPT-4 judge, heuristic rules)

---

## 🔧 Installation

### Requirements

- Python 3.8+
- CUDA 11.0+ (for GPU acceleration, optional)

### Installation Steps

1. **Clone the repository**
2. **Install dependencies**
```bash
pip install -r requirements.txt
```

3. **Verify installation**
```bash
python quick_test.py
```

---

## 🚀 Quick Start

SERPANT provides two operational modes: **Simulation Mode** and **Real Data Mode**.

### 1. Simulation Mode

Simulation mode is used for algorithm research, parameter tuning, and method comparison.

#### 1.1 Basic Experiments

**Running a single SERPANT algorithm:**

```python
from core import serpant_algorithm
from simulation import generate_true_probs
import numpy as np

# Set parameters
m = 10          # Number of models
alpha = 0.1     # FWER level
max_t = 8000    # Maximum time steps

# Generate true probability matrix
np.random.seed(123)
true_probs_info = generate_true_probs(m, sd=1.0)

# Run algorithm
result = serpant_algorithm(
    m=m, 
    alpha=alpha, 
    true_probs=true_probs_info['probs'], 
    max_t=max_t,
    sampling_method="tournament",
    verbose=True
)

print(f"Final rejected hypotheses: {result['final_rejected'].sum()}")
```

#### 1.2 Parallel Experiment Evaluation

**FWER and Power Evaluation:**

```bash
python main.py --mode simulation --experiment fwer_power
```

Or in Python:

```python
from main import experiment_fwer_power_comparison

# Run FWER and Power comparison experiment
results = experiment_fwer_power_comparison()
```

**Top-K FWER Evaluation:**

```python
from main import main

# Run Top-k FWER evaluation
main()
```

#### 1.3 Covariate-Assisted Experiments

```python
from main import experiment_original_vs_covariate_sd_x

# Compare original algorithm vs covariate-assisted algorithm
results = experiment_original_vs_covariate_sd_x()
```

### 2. Real Data Mode

Real data mode is used for actual LLM ranking tasks, supporting model loading from Hugging Face and comparisons.

#### 2.1 Running Locally

**Basic Usage:**

```bash
# Quick test (using default configuration)
python main.py --mode real

# Run with configuration file
python main.py --mode real --config real_config_example.yaml

# Resume from checkpoint
python main.py --mode real --config your_config.yaml --checkpoint checkpoint.csv
```

#### 2.2 Running on Google Colab

**Step 1: Mount Google Drive and prepare environment**

```python
# In Colab notebook
from google.colab import drive
drive.mount('/content/drive')

# Navigate to project directory
%cd /content/drive/MyDrive/your_project_path/serpant_python/

# Install dependencies
!pip install -r requirements.txt
```

**Step 2: Login to Hugging Face (if using gated models)**

```python
from huggingface_hub import login
login()  # Will prompt for your HF token
```

**Step 3: Set API key (if using GPT-4 judge)**

```python
import os
os.environ["OPENAI_API_KEY"] = "your-api-key-here"
```

**Step 4: Run experiments**

```bash
# Run MMLU dataset experiment
!python main.py --mode real --config colab_config_mmlu.yaml

# Run text generation task experiment
!python main.py --mode real --config colab_config_template_generation.yaml
```

**Step 5: Resume from checkpoint (if interrupted)**

```bash
# Continue from where you left off
!python main.py --mode real --config colab_config_mmlu.yaml \
  --checkpoint mmlu_results/interactions_20251227_072321.csv
```

#### 2.3 Configuration Examples

**MMLU Dataset Configuration (colab_config_mmlu.yaml):**

```yaml
# Algorithm parameters
alpha: 0.1
max_t: 5000
sampling_method: "random_pair"
top_k: 3  # Optional: discover Top-3 models

# Data source
questions:
  type: "huggingface_dataset"
  dataset_name: "cais/mmlu"
  dataset_config: "all"
  split: "test"
  format: "mmlu"
  num_samples: 100  # Use 100 questions

# Model list
models:
  - name: "Qwen/Qwen2.5-1.5B-Instruct"
    provider: "huggingface"
  - name: "meta-llama/Llama-3.2-1B-Instruct"
    provider: "huggingface"
  - name: "google/gemma-2-2b-it"
    provider: "huggingface"
  # ... add more models

# Judge configuration
judge:
  type: "correctness"  # For MMLU, judge based on ground truth

# Output configuration
output:
  dir: "mmlu_results"
  save_interactions: true
  save_dag: true

# System configuration
device: "cuda"  # or "cpu"
batch_size: 1
hf_cache_dir: "/content/drive/MyDrive/hf_models/"  # Colab cache directory
```

**Custom Questions Configuration:**

```yaml
# Algorithm parameters
alpha: 0.1
max_t: 1000
sampling_method: "tournament"

# Use custom question list
questions:
  type: "list"
  items:
    - "Explain what machine learning is."
    - "Implement quick sort in Python."
    - "Compare the differences between TCP and UDP."

# Model list
models:
  - name: "Qwen/Qwen2.5-1.5B-Instruct"
    provider: "huggingface"
  - name: "meta-llama/Llama-3.2-1B-Instruct"
    provider: "huggingface"

# Use GPT-4 judge
judge:
  type: "gpt4"
  model: "gpt-4"
  api_key_env: "OPENAI_API_KEY"

output:
  dir: "custom_results"
```

---

## 📖 User Guide

### Simulation Mode Experiments

In `main.py`, we provide multiple pre-defined experiment functions:

| Experiment Function | Description | Purpose |
|---------|------|------|
| `main()` | Top-k FWER evaluation | Evaluate FWER of Top-k confidence sets |
| `experiment_fwer_power_comparison()` | FWER and Power comparison | Compare different sd and sampling methods |
| `example_single_run()` | Single algorithm run | Quick algorithm testing |
| `experiment_with_covariates()` | Covariate method comparison | Compare 4 methods (original+covariate×sampling) |
| `experiment_covariate_sd_x_comparison()` | Covariate sd_x effect | Evaluate impact of covariate variance |
| `experiment_original_vs_covariate_sd_x()` | Original vs Covariate | Comprehensive comparison of two algorithms |
| `experiment_tournament_priority_modes()` | Priority mode comparison | Compare different priority calculation modes |
| `experiment_tournament_uncertainty_weights()` | Weight optimization | Optimize uncertainty weight parameters |

### Real Data Mode Use Cases

#### Scenario 1: Comparing Multiple LLMs

```bash
# 1. Prepare configuration file my_llm_comparison.yaml
# 2. Run experiment
python main.py --mode real --config my_llm_comparison.yaml
```

**Output:**
- `Partial order matrix`: Which models significantly outperform others
- `Partial order DAG`: Visualized partial order relationships
- `Interaction history CSV`: All comparison records, supports checkpoint recovery

#### Scenario 2: Discovering Top-K Best Models

```yaml
# Set in configuration file
top_k: 5  # Discover Top-5 models
sampling_method: "tournament"  # Recommended for Top-K
```

#### Scenario 3: Using Different Datasets

**Supported datasets:**
- ✅ MMLU (multiple choice)
- ✅ TriviaQA (Q&A)
- ✅ Any Hugging Face dataset
- ✅ Custom CSV files
- ✅ Custom question lists

**Dataset configuration examples:**

```yaml
# MMLU
questions:
  type: "huggingface_dataset"
  dataset_name: "cais/mmlu"
  format: "mmlu"

# TriviaQA
questions:
  type: "huggingface_dataset"
  dataset_name: "trivia_qa"
  format: "triviaqa"

# Custom CSV
questions:
  type: "csv"
  path: "my_questions.csv"
  question_column: "question"
```

#### Scenario 4: Custom Judge Methods

```yaml
judge:
  # Method 1: Correctness-based (for datasets with ground truth)
  type: "correctness"
  
  # Method 2: GPT-4 judge (for open-ended questions)
  type: "gpt4"
  model: "gpt-4"
  system_prompt: "You are a fair judge..."
  
  # Method 3: Heuristic rules (quick testing)
  type: "heuristic"
```

---

## ⚙️ Configuration

### Complete Configuration Example

```yaml
# ========== Algorithm Parameters ==========
alpha: 0.1                    # FWER level
max_t: 5000                   # Maximum time steps
sampling_method: "random_pair" # Sampling method: all_active, random_pair, tournament
max_tournament_samples: 800   # Maximum samples for tournament
top_k: null                   # Top-k value, null means no Top-k computation

# ========== Data Source Configuration ==========
questions:
  # Option 1: Hugging Face dataset
  type: "huggingface_dataset"
  dataset_name: "cais/mmlu"
  dataset_config: "all"
  split: "test"
  format: "mmlu"              # mmlu, triviaqa, custom
  num_samples: 100            # Number of questions to use
  
  # Option 2: CSV file
  # type: "csv"
  # path: "questions.csv"
  # question_column: "question"
  
  # Option 3: Question list
  # type: "list"
  # items:
  #   - "Question 1"
  #   - "Question 2"

# ========== Model Configuration ==========
models:
  - name: "Qwen/Qwen2.5-1.5B-Instruct"
    provider: "huggingface"   # huggingface, openai, anthropic, stub
    device_map: "auto"
    load_in_4bit: false       # 4-bit quantization
    
  - name: "meta-llama/Llama-3.2-1B-Instruct"
    provider: "huggingface"
    device_map: "auto"

# ========== Judge Configuration ==========
judge:
  type: "correctness"         # correctness, gpt4, heuristic
  # type: "gpt4"
  # model: "gpt-4"
  # api_key_env: "OPENAI_API_KEY"

# ========== Output Configuration ==========
output:
  dir: "results"              # Output directory
  save_interactions: true     # Save interaction history (for checkpoint recovery)
  save_dag: true              # Save partial order DAG
  save_results: true          # Save final results

# ========== System Configuration ==========
device: "cuda"                # cuda, cpu
batch_size: 1
verbose: true
random_seed: 123
hf_cache_dir: "./hf_models/"  # Hugging Face model cache directory
```

---

## 📁 Project Structure

```
serpant_python/
├── core/                     # Core algorithm implementation
│   ├── algorithm.py          # SERPANT main algorithm (with covariate-assisted)
│   ├── e_value.py           # E-value computation
│   ├── transitivity.py      # Transitivity propagation
│   └── confidence_sets.py   # Confidence set construction
│
├── simulation/              # Simulation experiment module
│   ├── data_generator.py   # Data generator
│   ├── parallel_runner.py  # Parallel experiment runner
│   └── evaluator.py        # Result evaluator
│
├── real_world/              # Real data mode
│   ├── config.py           # Configuration management
│   ├── environment.py      # Environment setup
│   ├── model_clients.py    # Model clients (HF, OpenAI, etc.)
│   ├── judge.py            # Judge
│   ├── questions.py        # Question data sources
│   └── runner.py           # Real experiment runner
│
├── visualization/           # Visualization module
│   └── plots.py            # Plotting functions
│
├── utils/                   # Utility functions
│   └── helpers.py          # Helper functions
│
├── main.py                  # Main execution script
├── quick_test.py           # Quick test
├── requirements.txt        # Dependencies
│
├── colab_config_*.yaml     # Colab configuration examples
├── real_config_example.yaml # Local run configuration example
│
└── docs/                    # Documentation
    ├── QUICK_REFERENCE.md
    ├── REAL_MODE_GUIDE.md
    └── COLAB_GUIDE.md
```

---

## 🧪 Examples

### Example 1: Evaluating Different Sampling Methods

```python
from simulation import compare_methods_and_sd_fwer_power
from visualization import plot_fwer_power_comparison_grid

# Run comparison experiment
results = compare_methods_and_sd_fwer_power(
    m=20,
    alpha=0.1,
    num_simulations=1000,
    max_t=8000,
    sd_values=[0.5, 1, 2],
    sampling_methods=["random_pair", "tournament"],
    max_tournament_samples=800,
    random_seed=123
)

# Plot comparison
fig = plot_fwer_power_comparison_grid(
    results,
    alpha=0.1,
    save_path="figures/methods_comparison.png"
)

print(results['summary_stats'])
```

### Example 2: Running MMLU Experiment on Colab

```python
# Colab Notebook

# 1. Setup environment
from huggingface_hub import login
login()  # Enter your HF token

# 2. Install dependencies
!pip install -r requirements.txt

# 3. Run experiment
!python main.py --mode real --config colab_config_mmlu.yaml

# 4. View results
import pandas as pd
results = pd.read_csv("mmlu_results/interactions_*.csv")
print(results.head())

# 5. Visualize partial order DAG
from IPython.display import Image
Image("mmlu_results/partial_order_dag_*.png")
```

### Example 3: Custom Model Comparison

```python
from real_world.config import RealModeConfig, ModelConfig, JudgeConfig
from real_world import run_real_mode_experiment

# Create configuration
config = RealModeConfig(
    alpha=0.1,
    max_t=500,
    sampling_method="tournament",
    questions=[
        "Explain the basic principles of deep learning.",
        "Implement binary search in Python.",
        "Compare SQL and NoSQL databases."
    ],
    models=[
        ModelConfig(name="Qwen/Qwen2.5-1.5B-Instruct", provider="huggingface"),
        ModelConfig(name="meta-llama/Llama-3.2-1B-Instruct", provider="huggingface"),
        ModelConfig(name="google/gemma-2-2b-it", provider="huggingface"),
    ],
    judge=JudgeConfig(type="gpt4", model="gpt-4"),
    output={"dir": "my_comparison_results"}
)

# Run experiment
results = run_real_mode_experiment(config)
print(f"Experiment completed! Output directory: {results['output_dir']}")
```

---

## ❓ FAQ

### Q1: How to choose the right sampling method?

**Recommendations:**
- `all_active`: Suitable for small number of models (m<10)
- `random_pair`: Suitable for general scenarios, balancing efficiency and accuracy
- `tournament`: Suitable for Top-k discovery and large-scale comparisons (m>10)

### Q2: How to resume after interruption?

Use the `--checkpoint` parameter:

```bash
python main.py --mode real --config your_config.yaml \
  --checkpoint results/interactions_20250119_123456.csv
```

The system will automatically resume from where it left off.

### Q3: What to do if running out of memory on Colab?

**Solutions:**
1. Reduce the number of models
2. Use 4-bit quantization for model loading
3. Reduce the number of simultaneously loaded models (sequential loading)
4. Upgrade to Colab Pro for more memory

```yaml
models:
  - name: "Qwen/Qwen2.5-1.5B-Instruct"
    provider: "huggingface"
    load_in_4bit: true  # Enable 4-bit quantization
```

### Q4: How to understand FWER and Power?

- **FWER (Family-Wise Error Rate)**: Probability of making at least one error
  - Goal: Control below α level (e.g., 0.1)
  - FWER < α indicates good algorithm control
  
- **Power**: Proportion of correctly discovered true relationships
  - Higher is better
  - Power is related to sample size and effect size

### Q5: What judge methods are supported?

1. **Correctness**: Based on ground truth (suitable for datasets like MMLU)
2. **GPT-4**: Use GPT-4 for judging (suitable for open-ended questions)
3. **Heuristic**: Simple heuristic rules (quick testing)
4. **Custom**: Customizable judge functions

### Q6: How to add new models or datasets?

**Adding new models:**

Simply add model configuration in the config file:

```yaml
models:
  - name: "your-org/your-model"
    provider: "huggingface"
    device_map: "auto"
```

**Adding new datasets:**

```yaml
questions:
  type: "huggingface_dataset"
  dataset_name: "your-dataset"
  split: "test"
  format: "custom"  # Need to implement format parsing in questions.py
```

---

## 📊 Performance and Resource Requirements

### Simulation Mode

- **CPU**: Multi-core parallel support (automatically uses cpu_count()-1 cores)
- **Memory**: Depends on number of models m and simulation runs
  - m=10, 1000 simulations: ~2GB
  - m=50, 1000 simulations: ~8GB

### Real Data Mode

| Model Size | GPU Memory | Notes |
|---------|------------|------|
| 1-3B | 6-8GB | Suitable for Colab Free |
| 7B | 16GB | Requires Colab Pro or 4-bit quantization |
| 13B+ | 24GB+ | Recommended to use A100 or multi-GPU |

**Optimization Tips:**
- Use `load_in_4bit=true` to reduce memory usage
- Use `device_map="auto"` for automatic device allocation
- Set batch size to 1 to save memory

---

## 🙏 Acknowledgments

Thanks to all contributors and supporters!

Special thanks to:
- Hugging Face for providing models and datasets
- OpenAI for API services
- All open-source community contributions

---

<div align="center">
</div>
