# EvaLearn Evaluation Framework

This repository contains a framework for evaluating language model performance on sequential question-answering tasks. The framework allows you to test how well language models can handle sequences of related questions, tracking their performance across different question types and domains.

## Overview

The EvaLearn evaluation framework consists of:

1. A sequential evaluation tool (`evaluate.py`) that processes sequences of questions
2. A dataset of problem definitions (`EvaLearn_Problem.json`)
3. A dataset of sequence definitions (`EvaLearn_Sequence.json`)

## Getting Started

### Prerequisites

- Python 3.7+
- OpenAI API key or other LLM API access

### Installation

```bash
git clone https://github.com/yourusername/EvaLearn.git
cd EvaLearn
pip install -r requirements.txt
```

## Usage

The main entry point is the `sequentialEval` function in `evaluate.py`. You can run it from the command line:

```bash
python EvaLearn/Evaluate/evaluate.py --input EvaLearn/Dataset/EvaLearn_Problem.json \
                                    --seq EvaLearn/Dataset/EvaLearn_Sequence.json \
                                    --output results.json \
                                    --workers 4 \
                                    --client-api-key YOUR_CLIENT_API_KEY \
                                    --judge-api-key YOUR_JUDGE_API_KEY
```

### Command Line Arguments

- `--input`: Path to the problem JSON file (required)
- `--seq`: Path to the sequence JSON file (required)
- `--output`: Path to save the evaluation results (required)
- `--workers`: Number of worker threads for parallel processing (optional)
- `--no-check-empty`: Skip checking for empty responses (optional)
- `--judge-api-key`: API key for the judge model (optional)
- `--client-api-key`: API key for the client model (optional)
- `--judge-model`: Model to use for judging (default: "gpt-4o-2024-11-20")
- `--client-model`: Model to use for client responses (default: "gpt-4o-2024-11-20")

### Using as a Library

You can also import and use the functions directly in your Python code:

```python
from EvaLearn.Evaluate.evaluate import sequentialEval, load_evaluation_data, select_sequences_for_evaluation

# Load data
sequences, problems_dict = load_evaluation_data(
    "EvaLearn/Dataset/EvaLearn_Sequence.json",
    "EvaLearn/Dataset/EvaLearn_Problem.json"
)

# Select specific sequences
selected_sequences = select_sequences_for_evaluation(
    sequences, 
    num_sequences=5,  # Randomly select 5 sequences
    sequence_types=["Logical Reasoning"]  # Only select sequences of this type
)

# Run evaluation
sequentialEval(
    input_json_path="EvaLearn/Dataset/EvaLearn_Problem.json",
    seq_json_path="EvaLearn/Dataset/EvaLearn_Sequence.json",
    output_json_path="results.json",
    worker_nums=4,
    client_api_key="YOUR_CLIENT_API_KEY",
    judge_api_key="YOUR_JUDGE_API_KEY"
)
```

## Data Format

### Problem JSON Format

Each problem in `EvaLearn_Problem.json` has the following structure:

```json
{
  "id": 1,
  "type": "Logical Reasoning",
  "source": "LogicGame-crypto_puzzle",
  "level": 1,
  "prompt": ["The question text that will be presented to the model"],
  "rubric": "Evaluation criteria for judging the model's response",
  "canonical_answer": "The expected correct answer"
}
```

- `id`: Unique identifier for the problem
- `type`: Category of the problem (e.g., "Logical Reasoning", "Mathematical Reasoning")
- `source`: Origin of the problem
- `level`: Difficulty level
- `prompt`: The question text (can be a string or an array of strings)
- `rubric`: Criteria used by the judge model to evaluate responses
- `canonical_answer`: The expected correct answer

### Sequence JSON Format

Each sequence in `EvaLearn_Sequence.json` has the following structure:

```json
{
  "sequence_id": 1,
  "type": "Extraction",
  "question_ids": [252, 258, 297, 263, 245, 273, 241]
}
```

- `sequence_id`: Unique identifier for the sequence
- `type`: Category of the sequence (e.g., "Extraction", "Logical Reasoning")
- `question_ids`: Ordered list of problem IDs that form the sequence

## Key Functions

### `sequentialEval`

The main evaluation function that processes sequences of questions.

```python
sequentialEval(
    input_json_path,
    seq_json_path,
    output_json_path,
    worker_nums=None,
    check_empty=True,
    judge_api_key=None,
    client_api_key=None,
    judge_model="gpt-4o-2024-11-20",
    client_model="gpt-4o-2024-11-20"
)
```

### `load_evaluation_data`

Loads and validates sequence and problem data.

```python
load_evaluation_data(sequence_path, problem_path)
```

### `select_sequences_for_evaluation`

Selects sequences for evaluation based on criteria.

```python
select_sequences_for_evaluation(
    sequences, 
    num_sequences=None, 
    sequence_ids=None, 
    sequence_types=None
)
```

### `evaluate_sequence`

Evaluates a complete sequence of questions.

```python
evaluate_sequence(
    sequence, 
    problems_dict, 
    annotator, 
    output_dir, 
    save_results=True
)
```


## License

This project is licensed under the MIT License - see the LICENSE file for details.
