# IMProofBench

An AI benchmark for research-level mathematical reasoning.

Website: [https://improofbench.math.ethz.ch/](https://improofbench.math.ethz.ch/)

## Overview

IMProofBench tests AI ability to solve problems from real-world mathematical research, generating long-form proofs that maintain rigor and avoid hallucinations. Unlike existing benchmarks focused on numerical answers or formal verification, IMProofBench evaluates natural language mathematical arguments.

## Key Features

- **Research-level problems**: PhD level and above, covering pure mathematics, applied mathematics, theoretical CS, and mathematical physics
- **Human grading**: Human mathematicians score proof quality
- **Unique-answer subquestions**: Enable fast automated partial evaluation
- **Private problem set**: Majority of problems kept confidential to prevent training contamination

## Architecture

IMProofBench consists of two main components:

1. **Django Web Application**: A full-featured web platform for:
   - Creating and managing benchmark problems with LaTeX + Markdown support
   - Peer review workflow for question quality assurance
   - Blind grading interface for evaluating model responses
   - User management with role-based permissions (contributors, reviewers, admins)

2. **Model Evaluation Framework**: Based on [Inspect AI](https://github.com/UKGovernmentBEIS/inspect_ai):
   - Multi-turn prompting with tool access
   - Docker-sandboxed execution environment
   - Support for multiple AI providers (OpenAI, Anthropic, Google, xAI, Groq)
   - Mathematical computation tools (SageMath, Python, web search)

## Technology Stack

- **Backend**: Django 5.2, django-allauth, django-crispy-forms
- **Frontend**: Bootstrap 5, MathJax for LaTeX rendering
- **Evaluation**: Inspect AI framework, Docker sandboxing
- **Database**: SQLite

## Installation

### Prerequisites

- Python 3.8+
- Docker (required for model evaluation sandboxing)

### Setup

1. **Clone the Repository**
   ```bash
   git clone <url>
   cd improofbench
   ```

2. **Create and Activate Virtual Environment**
   ```bash
   python3 -m venv venv
   source venv/bin/activate # On Windows: venv\Scripts\activate
   ```

3. **Install Dependencies**
   ```bash
   pip install -r requirements.txt
   ```

4. **Configure Environment**
   ```bash
   cp web/.env.example web/.env
   # Edit web/.env with your settings (SECRET_KEY, API keys, etc.)
   ```

5. **Set Up Database**
   ```bash
   cp public_questions.db questions.db
   cd web
   python manage.py migrate
   ```

6. **Run Development Server**
   ```bash
   python manage.py runserver
   ```

7. **Access the Application**
   - Open your browser to `http://127.0.0.1:8000/`
   - Log in with the sample admin account:
     - Email: `admin@example.com`
     - Password: `Pr00fB3nch!2025`

### API Keys (Optional)

For AI testing and model evaluation features, configure API keys in `web/.env`:
- `OPENAI_API_KEY` - For quick AI testing feature
- Additional provider keys stored in database via Django admin

## Problem Format

Each benchmark problem contains:
- **Main question**: Requires a complete mathematical proof
- **Subquestions**: Short-answer questions with unique solutions for automated grading
- **Metadata**: Difficulty ratings, mathematical area, author information

## Contributing

See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines on contributing to IMProofBench.

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## Contact
