# Conformal Prediction for LLM Query Routing

This repository contains implementations of conformal prediction frameworks for efficiently routing queries between small and large language models across multiple benchmarks.

## Overview

The project evaluates performance, cost, and safety tradeoffs using three main benchmarks:
- **MMLU**: Massive Multitask Language Understanding (multiple-choice questions)
- **TruthfulQA**: Evaluation of model truthfulness
- **PKU-SafeRLHF**: Assessment of safety-helpfulness balance

## Requirements

- Python 3.8+
- Required packages: numpy, matplotlib, json, os, random, datetime, tqdm
- OpenAI API key (set as environment variable `OPENAI_API_KEY`) for running new experiments
  - Note: Existing result directories can be analyzed without an API key
  - For PKU-SafeRLHF, an API key is required even when using existing result directories due to dataset loading requirements

## Running the Experiments

### MMLU Evaluation

```bash
python mmlu_main.py
```

This script evaluates the MMLU dataset using conformal risk control to route queries between small and large language models. Results will be saved in the `mmlu_results/` directory.

### TruthfulQA Evaluation

Basic evaluation:
```bash
python truthfulqa_main.py
```

Variant using GPT-4.1-mini as the small model:
```bash
python truthfulqa_mini.py
```

Variant using direct confidence scores:
```bash
python truthfulqa_base_scores.py
```

Results will be saved in various subdirectories under `truthqa_results/`.

### PKU-SafeRLHF Evaluation

```bash
python pku_saferlhf_main.py
```

Evaluates the PKU-SafeRLHF dataset with metrics for both safety and helpfulness. Results will be saved in the `saferlhf_results/` directory.

## Visualization

To generate plots for the experimental results:

For general plotting:
```bash
python plotting.py
```

For PKU-SafeRLHF specific plots:
```bash
python pku_saferlhf_plotting.py
```

For plots including unrestricted hybrid models:
```bash
python unrestricted_plotting.py
```

## Results

The experiments generate various JSON files with detailed results:
- Calibration data: `calibration_data_standard_trial_*.json`
- Data splits: `data_split_trial_*.json`
- Detailed results: `detailed_trial_*_results.json`
- Scored examples: `scored_examples_trial_*.json`
- Final results: `*_final_results_*_trials.json`

Visualization outputs include:
- Cost vs. accuracy plots
- Enhanced performance comparisons
- Loss vs. accuracy tradeoffs
- Lambda vs. alpha relationships