# 🥇 HiPhO: High School Physics Olympiad Benchmark

The source data and code for the ICML-2026 submission, titled **"HiPhO: How Far Are (M)LLMs from Humans in the Latest High School Physics Olympiad Benchmark?"**.

## 🌐 Introduction

**HiPhO** (High School Physics Olympiad Benchmark) is the **first benchmark** specifically designed to evaluate the physical reasoning abilities of (M)LLMs on **real-world Physics Olympiads from 2024–2025**.

<div align="center">
  <img src="intro/HiPhO_overview.png" alt="hipho overview five rings" width="600"/>
</div>

### ✨ Key Features

1. **Up-to-date Coverage**: Includes 13 Olympiad exam papers from 2024–2025 across international and regional competitions.
2. **Mixed-modal Content**: Supports four modality types, spanning from text-only to diagram-based problems.
3. **Professional Evaluation**: Uses official marking schemes for answer-level and step-level grading.
4. **Human-level Comparison**: Maps model scores to medal levels (Gold/Silver/Bronze) and compares with human performance.

### 📊 Dataset Overview

<div align="center">
  <img src="intro/HiPhO_statistics.png" alt="framework and stats" width="700"/>
</div>

HiPhO contains:
- **13 Physics Olympiads**
- **360 Problems**
- Categorized across:
  - **5 Physics Fields**: Mechanics, Electromagnetism, Thermodynamics, Optics, Modern Physics
  - **4 Modality Types**: Text-Only, Text+Illustration Figure, Text+Variable Figure, Text+Data Figure
  - **6 Answer Types**: Expression, Numerical Value, Multiple Choice, Equation, Open-Ended, Inequality

Evaluation is conducted using:  
- **Answer-level and step-level scoring**, aligned with official marking schemes  
- **Exam score** as the evaluation metric  
- **Medal-based comparison**, using official thresholds for gold, silver, and bronze  

---

## Evaluation

### Directory Structure

```
project-root/
├── code/
│   ├── eval.py                # Main evaluation script
│   └── evaluator.py           # Evaluator core module
├── data/                      # HiPhO dataset files
├── utils/
│   └── verifier.py            # Physics evaluation utility functions
└── README.md                  # This file
```

**Important Note**: Please ensure to run evaluation scripts in the `code/` directory so that relative paths work correctly.

### Environment Requirements

#### Basic Environment
- Python 3.10

#### VLMEvalKit Dependencies
The evaluation framework depends on evaluation tools provided by [VLMEvalKit](https://github.com/open-compass/VLMEvalKit).

**Installation Steps:**
```bash
# 1. Create virtual environment
python -m venv vlmeval
source vlmeval/bin/activate  

# 2. Clone and install VLMEvalKit
git clone https://github.com/open-compass/VLMEvalKit.git
cd VLMEvalKit
pip install -e .

# 3. Install other dependencies
pip install pylatexenc math_verify 
```

### Configuration File

Create a `.env` file in the root directory to set API keys and model parameters:

```bash
# API Configuration
OPENAI_API_KEY=http://your-api-server

# Judge Model Parameters
JUDGE_TIMEOUT=1800          # Request timeout (seconds)
JUDGE_RETRY=10              # Number of retries
JUDGE_MAX_TOKENS=16384      # Maximum token count

# Verifier Model Configuration
VERIFIER_MODEL_NAME=gemini-2.5-flash
VERIFIER_API_KEY=your_verifier_api_key_here
VERIFIER_BASE_URL=http://your-api-server
VERIFIER_MAX_TOKENS=16384
VERIFIER_TEMPERATURE=0.1
```


## Quick Start

### 1. Preparation

```bash
# Activate virtual environment
source /path/to/vlmeval/bin/activate

# Enter code directory
cd /path/to/code

# Ensure .env file is configured (refer to Configuration File section above)
```

### 2. Basic Evaluation Commands

```bash
# Complete evaluation example (recommended)
python eval.py --dataset IPhO_2025 \
               --judge-model gemini-2.5-flash \
               --model-name Gemini-3-Pro \
               --nproc 10 \
               --multi-runs

# Evaluate all available datasets
python eval.py --judge-model gemini-2.5-flash

# Evaluate specific dataset (single run)
python eval.py --dataset IPhO_2025 --judge-model gemini-2.5-flash

# Coarse-grained evaluation only (without Judge model)
python eval.py --dataset IPhO_2025 --no-judge
```

### 3. Common Parameter Descriptions

| Parameter | Description | Example |
|------|------|------|
| `--dataset` | Specify dataset | `--dataset IPhO_2025` |
| `--judge-model` | Judge model name | `--judge-model gemini-2.5-flash` |
| `--model-name` | Model name to be evaluated | `--model-name Gemini-3-Pro` |
| `--multi-runs` | Evaluate multiple run results | `--multi-runs` |
| `--nproc` | Number of parallel processes | `--nproc 10` |
| `--no-judge` | Disable Judge model | `--no-judge` |
| `--dry-run` | Dry run mode | `--dry-run` |
| `--verbose` | Verbose output | `--verbose` |

## Data Format Requirements

### Inference Results Directory Structure

Inference results should be organized in the following format:

```
infer_results/
└── ModelName_DatasetName/
    ├── run_01/
    │   └── ModelName_DatasetName_results.json
    ├── run_02/
    │   └── ModelName_DatasetName_results.json
    └── ...
```

### Inference Results JSON Format

Each result file should contain the following fields:

```json
[
    {
        "id": "IPhO_2025_1_A_1",
        "context": "Problem background...",
        "question": "Specific question...",
        "marking": [["Scoring criterion 1", "Scoring criterion 2"]],
        "answer": ["\\boxed{answer}"],
        "answer_type": ["Expression"],
        "unit": [null],
        "points": [0.2],
        "prediction": "Model's predicted response...",
        "modality": "text",
        "field": "Modern Physics",
        "source": "IPhO_2025"
    }
]
```

### Evaluation Methods

The system supports two evaluation methods:

#### 1. Fine-grained Evaluation (requires Judge model)
- Step-by-step scoring based on marking criteria
- Supports partial credit assessment
- Requires setting up Judge model and API key

#### 2. Coarse-grained Evaluation (default)
- Exact matching based on final answers
- Uses multiple mathematical verification methods and models for evaluation

The final score takes the maximum of both evaluation methods.

### Output Results

After evaluation completion, the following files will be generated in the specified output directory:

- `{dataset}_score.json`: Summary evaluation results
- `{dataset}_detailed_results.json`: Detailed evaluation results
- `{dataset}_detailed.xlsx`: Detailed results in Excel format

For multiple runs evaluation, additional files will be generated:
- `{dataset}_multi_run_statistics.json`: Multiple runs statistics
- `{dataset}_question_statistics.xlsx`: Question-level statistics
- `{dataset}_run_summary.xlsx`: Run summary statistics