# Puzzle Benchmark System

A research-oriented benchmark system for evaluating language models on domain-specific puzzle-solving tasks.

## Overview

This repository contains a puzzle benchmark system designed to evaluate the reasoning and domain knowledge capabilities of large language models (LLMs) across various academic and professional domains. The system presents models with domain-specific "puzzle" questions that require both factual knowledge and logical reasoning to solve.

## Project Structure

```
zipdata/
├── code/                    # Source code
│   ├── core/               # Core system components
│   │   ├── puzzle_system.py    # Main system controller
│   │   ├── data_manager.py     # Data loading and management
│   │   ├── game_session.py     # Individual test session logic
│   │   ├── api_client.py       # API client for LLM interactions
│   │   ├── logger_manager.py   # Logging and result management
│   │   └── response_parser.py  # Response parsing utilities
│   ├── prompts/            # Domain-specific prompts
│   ├── config.py           # Configuration settings
│   └── demo_single_test.py # Single puzzle demo script
└── data/                   # Test data and results
    ├── domains/            # Domain-specific puzzle datasets
    └── logs/              # Test results and session logs
```

## Features

- **Multi-domain Support**: Tests across various academic domains (biology, computer science, mathematics, etc.)
- **Configurable API Integration**: Supports multiple LLM providers (OpenAI, Anthropic, etc.)
- **Detailed Logging**: Comprehensive session logging and result tracking
- **Extensible Design**: Easy to add new domains and customize evaluation criteria

## Quick Start

### Prerequisites

- Python 3.8+
- Required packages: `asyncio`, `aiohttp`, `pandas`, `openpyxl`

### Installation

1. Clone this repository:
```bash
git clone <repository-url>
cd puzzle-benchmark
```

2. Install dependencies:
```bash
pip install asyncio aiohttp pandas openpyxl
```

3. Configure your API settings in `code/config.py`:
```python
API_CONFIGS = {
    "openai_compatible": {
        "base_url": "https://api.openai.com/v1/chat/completions",
        "api_key": "your-api-key-here",  # Replace with your actual API key
        "models": ["gpt-4o", "claude-opus-4"],
        # ... other settings
    }
}
```

### Running a Demo

To run a single puzzle test:

```bash
cd zipdata
python code/demo_single_test.py
```

This will:
1. Load available domains and puzzles
2. Select the first available puzzle
3. Run a complete test session (up to 15 rounds)
4. Display results and save logs

### Configuration

Key configuration options in `code/config.py`:

- **Models**: Configure QA and evaluation models
- **API Settings**: Set up API endpoints and authentication
- **Game Settings**: Adjust maximum rounds and other parameters
- **Paths**: Configure data and log directories

## API Integration

The system supports multiple LLM providers through a unified API client. Configure your preferred provider in the `API_CONFIGS` section:

1. **OpenAI Compatible APIs**: GPT-4, Claude, Gemini
2. **Custom Endpoints**: Any OpenAI-compatible API
3. **Multiple Providers**: Switch between different APIs for different models

## Data Format

Puzzle data is stored in Excel files with the following structure:

| Name | Description |
|------|-------------|
| Puzzle_1 | Description of the puzzle/question |
| Puzzle_2 | Another puzzle description |

Each domain should have its own directory under `data/domains/` with corresponding Excel files.

## Research Applications

This benchmark system is designed for:

- **Model Evaluation**: Compare reasoning capabilities across different LLMs
- **Domain Analysis**: Assess model performance in specific knowledge areas
- **Prompt Engineering**: Test different prompting strategies
- **Academic Research**: Provide standardized evaluation metrics

## Limitations

- Requires API access to language models (costs may apply)
- Single-threaded execution (concurrent features removed for simplicity)
- Limited to text-based puzzles (no multi-modal support)

## Contributing

This is a research-oriented project. For academic use and modifications:

1. Ensure proper attribution in publications
2. Follow ethical guidelines for AI research
3. Respect API usage limits and costs

## License

This project is released for academic and research purposes. Please cite appropriately if used in published research.

## Contact

For questions about this benchmark system or research collaborations, please refer to the associated academic publication.

---

**Note**: This system is provided as-is for research purposes. Users are responsible for API costs and ensuring compliance with relevant terms of service for language model providers.
