# MATH-BEYOND: A Benchmark for RL to Expand Beyond the Base Model

This repository contains the code and evaluation scripts for the MATH-Beyond (MATH-B) benchmark, designed to challenge reinforcement learning methods in mathematical reasoning beyond what base models can achieve with large sampling budgets.

## Dataset

The main benchmark dataset is provided as `union_dataset_comprehensive.parquet`, containing **181 carefully selected mathematical problems** that are challenging for current open-source models. Each problem includes:

- **Problem statement and solution**: High-school level mathematics drawn from DAPO-Math-17K and DeepScaleR datasets
- **GPT responses**: Reference solutions from GPT-5-mini and o4-mini-high models  
- **Difficulty ratings**: Professional difficulty assessments
- **Model performance flags**: True/False indicators showing whether each evaluated model solves the problem at pass@1024

### Evaluated Models

The benchmark is constructed using two sets of models to ensure comprehensive coverage:

**Base Models**: This set is used to define the most challenging subset of our benchmark. It includes: Qwen2.5-1.5B, Qwen2.5-7B, Qwen2.5-Math-1.5B, Qwen2.5-Math-7B, Qwen3-4B-Base, Qwen3-8B-Base, DeepSeek-R1-Distill-Qwen2.5-1.5B, DeepSeek-R1-Distill-Qwen2.5-7B, OLMo-7B, OLMo-2-7B, and Llama-3.1-8B.

**Supplementary Models**: This group is combined with the base models to define the full benchmark. It includes: Qwen2.5-1.5B-Instruct, Qwen2.5-7B-Instruct, Qwen2.5-Math-1.5B-Instruct, Qwen2.5-Math-7B-Instruct, Qwen3-4B, Qwen3-8B, DeepScaler-1.5B, Nemotron-Research-Reasoning-Qwen-1.5B (v1 and v2), and Skywork-OR1-7B.

The problems in this benchmark were specifically selected because they remain challenging even when these models are given large sampling budgets (pass@1024), making them ideal for testing RL methods that aim to discover new reasoning capabilities rather than just sharpening existing ones.

## Overview

MATH-B addresses a critical limitation in the current RL fine-tuning landscape: existing benchmarks like MATH-500 and AIME 2024 can be largely solved by base models with sufficient sampling (e.g., pass@1024). Our benchmark is deliberately constructed to defeat common open-source models of up to 8B parameters even under large sampling budgets, requiring RL methods that learn genuinely new reasoning capabilities.

## Key Features

- **Challenging Problems**: Selected from subsets of DAPO-Math-17K and DeepScaleR datasets
- **High School Math**: Problems remain topically equivalent to standard high-school mathematics
- **RL-Focused**: Designed to require exploration-driven approaches rather than just sharpening existing solution modes
- **Comprehensive Evaluation**: Tools for evaluating pass@k metrics with large sampling budgets

## Repository Structure

```
├── data/                           # Additional dataset files (optional)
├── outputs/                        # Generated outputs and results
├── prompts/                        # Prompt templates for classification
│   ├── prompt_domain_classification.txt
│   └── prompt_difficulty_classification.txt
├── classify_domain_difficulty.py   # Domain and difficulty classification script
├── classify_domain_difficulty.sh   # Shell wrapper for classification
├── fetch_responses.py              # GPT response fetching script
├── fetch_responses_final_data_o4_mini_high.sh  # Shell wrapper for response fetching
├── evaluate_progressive_batched.py # Progressive evaluation for large datasets
├── evaluate_progressive_batched.sh # Shell wrapper for evaluation
├── union_dataset_comprehensive.parquet  # Main MATH-B dataset (181 problems)
├── requirements.txt                # Python dependencies
├── .gitignore                      # Git ignore rules
└── README.md                       # This file
```

## Quick Start

### Prerequisites

1. Python 3.8+
2. Virtual environment setup:
```bash
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
```

3. Install dependencies:
```bash
pip install pandas numpy tqdm requests tenacity datasets transformers torch vllm
```

4. Set up your API keys:
```bash
export OPENROUTER_API_KEY="your_api_key_here"
# or
export OPENAI_API_KEY="your_api_key_here"
```

### Dataset Preparation

The main MATH-B benchmark dataset is provided as `union_dataset_comprehensive.parquet` in the root directory. This file contains 181 challenging mathematical problems with the following structure:

- `prompt`: The problem statement (can be nested structure)
- `reward_model`: Contains ground truth answers
- `extra_info`: Additional metadata including difficulty ratings
- Model performance columns: True/False flags indicating whether each evaluated model solves the problem at pass@1024

For additional datasets, place your files in the `data/` directory following the same Parquet format.

### Running Evaluations

#### 1. Progressive Evaluation (Recommended)

For large-scale evaluation that efficiently finds challenging problems:

```bash
# Edit the script to set your model path and parameters
./evaluate_progressive_batched.sh <job_id>
```

Key parameters to configure:
- `MODEL_PATH`: HuggingFace model ID (e.g., "meta-llama/Llama-3.1-8B")
- `DATASET_PATH`: Already configured to use `union_dataset_comprehensive.parquet`
- `TARGET_SAMPLES`: Maximum samples to generate (default: 1024)
- `BATCH_SIZE`: Problems processed simultaneously (default: 64)

#### 2. GPT Response Collection

To collect GPT model responses for comparison:

```bash
# Configure the script with your model and dataset
./fetch_responses_final_data_o4_mini_high.sh
```

#### 3. Problem Classification

To classify problems by domain and difficulty:

```bash
# Run domain/difficulty classification
./classify_domain_difficulty.sh
```

## Configuration

### Model Settings

Edit the shell scripts to configure:
- **Model Path**: HuggingFace model identifier or local path
- **Generation Parameters**: Temperature (0.6), top_p (0.95), max_tokens
- **Hardware**: Tensor parallelism size, GPU memory utilization

### Evaluation Parameters

- **Progressive Sampling**: Start with 64 samples, increment by 64, up to 1024
- **Batch Processing**: Process 64 problems at once for memory efficiency
- **Grading**: Parallel grading with configurable worker count

### API Configuration

For GPT model evaluation:
- **Concurrency**: Number of parallel API requests
- **Fallbacks**: Enable model/provider fallbacks on OpenRouter
- **Rate Limiting**: Built-in retry logic with exponential backoff

## Output Formats

### Progressive Evaluation Results

```
evaluations/
└── <experiment_name>/
    ├── challenging_problems_job_*.parquet  # Per-job challenging problems
    ├── challenging_problems_all.parquet    # Consolidated results
    └── summary.json                        # Summary statistics
```

### GPT Response Collection

```
outputs/
├── responses_<model>_dataset_seed_<seed>_n<samples>.jsonl
└── classify_<model>_<test>_seed_<seed>.jsonl
```

## Key Scripts

### `evaluate_progressive_batched.py`

The core evaluation script that:
- Loads models using vLLM for efficient inference
- Implements progressive sampling (start small, increase for hard problems)
- Processes data in batches to handle large datasets
- Grades responses using mathematical verification
- Saves challenging problems that achieve 0 pass@k

### `fetch_responses.py`

Collects GPT model responses via OpenRouter API:
- Supports multiple responses per question
- Handles various prompt formats
- Includes retry logic and error handling
- Saves results in JSONL format

### `classify_domain_difficulty.py`

Classifies mathematical problems:
- Uses GPT models for domain classification
- Assigns difficulty ratings based on competition standards
- Supports batch processing with resume capability
- Outputs structured classification data

## Advanced Usage

### Custom Datasets

To use your own dataset:
1. Convert to Parquet format with required columns
2. Update dataset paths in shell scripts
3. Adjust any dataset-specific processing in Python scripts

### Model Adaptation

To evaluate different models:
1. Update `MODEL_PATH` in evaluation scripts
2. Adjust tensor parallelism based on model size
3. Configure memory settings for your hardware

### Distributed Evaluation

For large-scale evaluation:
1. Set `NUM_JOBS` and `JOB_ID` in evaluation scripts
2. Run multiple jobs in parallel
3. Use aggregation mode to combine results

## Troubleshooting

### Common Issues

1. **CUDA Out of Memory**: Reduce batch size or increase tensor parallelism
2. **API Rate Limits**: Reduce worker count or add delays
3. **Dataset Format**: Ensure Parquet files have required columns
4. **Path Issues**: Use absolute paths or ensure working directory is correct

### Debug Mode

Enable debug logging by setting environment variable:
```bash
export EVAL_DEBUG=1
```

## Citation

If you use MATH-Beyond in your research, please cite:

```bibtex
@article{mathbeyond2024,
  title={MATH-BEYOND: A Benchmark for RL to Expand Beyond the Base Model},
  author={Anonymous},
  year={2024}
}
```

## License

This project is released under the MIT License. See LICENSE file for details.

## Contributing

We welcome contributions to improve MATH-Beyond. Please:
1. Fork the repository
2. Create a feature branch
3. Submit a pull request with detailed description

## Support

For questions and issues, please open a GitHub issue with:
- Detailed problem description
- Steps to reproduce
- System configuration
- Error messages (if any)