# HLE - High-Level Evaluation Framework

This repository contains tools for evaluating Large Language Models (LLMs) using both zero-shot and Best-of-N (BoN) sampling approaches.

## Overview

The framework provides utilities for testing LLM performance on complex questions across various domains including Biology/Medicine, Math, Computer Science/AI, and Chemistry. It includes two main components:


1. **bon.py** - Implements Best-of-N sampling
2. **zs.py** - Zero-shot evaluation tool for baseline comparisons

## Directory Structure

```
hle/
├── README.md                 # This file
├── LICENSE                   # MIT License
├── HLE_appendix.pdf         # Appendix
├── code/                    # Implementation directory
│   ├── bon.py              # Best-of-N sampling implementation
│   ├── zs.py               # Zero-shot evaluation tool
│   ├── optillm.py          # Integration wrapper
│   └── optillm/            # Library
└── data/                    # Dataset directory
    ├── hle_100.json        # Test dataset with 100+ expert questions
    └── 66edc256d0ce7f9082f8d744.mp4 # Supporting media files

```

## Files and Components

### Core Implementation Files

#### bon.py
A Best-of-N sampling implementation that serves as a drop-in replacement for the optillm library. Located in both root and `code/` directory. Features:
- **Anthropic Claude models** - Full support for Claude API
- **OpenAI models** - Including special handling for gpt-4o-search-preview
- **Google Gemini models** - Support for Gemini API including flash-thinking variants
- Automatic candidate generation and rating
- Result persistence to JSON for analysis
- Rate limiting and retry logic with exponential backoff
- **Other models** - Integrates with optillm library for additional model support

#### zs.py
Zero-shot evaluation tool for baseline performance metrics. Located in both root and `code/` directory. Features:
- Support for multiple LLM APIs (Gemini, Claude)
- JSON-based test case management
- Performance timing and error tracking
- Batch processing of test cases

### Data Files

#### hle_100.json
Test dataset containing 100+ challenging expert-level questions across:
- Biology/Medicine
- Mathematics
- Computer Science/AI
- Chemistry
- Each entry includes: ID, category, query, expected answer, optional system prompt

### Supporting Files

#### code/optillm/
The original optillm library directory, used for supporting additional models not directly implemented in bon.py

#### HLE_appendix.pdf
Research paper with comprehensive analysis of Best-of-N vs zero-shot approaches

#### LICENSE
MIT License for open-source usage

## Installation

```bash
# Install required dependencies
pip install anthropic google-generativeai requests
```

## Usage

### Using bon.py as an optillm replacement

```python
from bon import best_of_n_sampling
from anthropic import Anthropic

client = Anthropic(api_key="your-api-key")
result = best_of_n_sampling(
    system_prompt="You are a helpful assistant",
    initial_query="Explain quantum computing",
    client=client,
    model="claude-3-5-sonnet-20240620",
    n=8
)
```

### Running zero-shot evaluation

```bash
python zs.py --model gemini-1.5-pro-002 --input_file hle_100.json --output_file results.json
```

## Environment Variables

Set the following environment variables for API access:
- `ANTHROPIC_API_KEY` - For Claude models
- `OPENAI_API_KEY` - For OpenAI models
- `GEMINI_API_KEY` - For Google Gemini models

## Output

The BoN implementation saves detailed results to `output/bon_results/` including:
- All generated candidates
- Individual ratings and explanations
- Final selection with justification
- Timing information

## Appendix

For detailed methodology, experimental results, and comprehensive analysis of the Best-of-N sampling approach compared to zero-shot prompting, please refer to the accompanying research paper: **[HLE_appendix.pdf](HLE_appendix.pdf)**

The appendix includes:
- Detailed experimental setup and methodology
- Performance comparisons across different model families
- Statistical analysis of BoN effectiveness
- Domain-specific performance breakdowns
- Cost-benefit analysis of different N values
- Case studies of challenging questions where BoN significantly outperforms zero-shot

## License

MIT License - See LICENSE file for details