# Evaluation Guide

This guide provides step-by-step instructions for evaluating model-generated code patches against SWE-bench Atlas.



## 0. Prerequisites

Before you begin, ensure you have the following installed:
* Python (3.10 or newer)
* `git`
* [Docker](https://docs.docker.com/engine/install/) 
    Ensure the Docker daemon is running before you start the evaluation.

> **Linux Docker Setup**
> If you are on Linux, we highly recommend following the [post-installation steps](https://docs.docker.com/engine/install/linux-postinstall/) to manage Docker as a non-root user.

## 1. Environment Setup
First, download the delivery folder and go to the `SWE-Bench-Atlas` folder and set up a local Python virtual environment.

```bash
python3 -m venv .venv
source .venv/bin/activate
pip install -e .
```

## 2. Data Preparation

Evaluation requires two key dataset files in `.jsonl` format.

#### A. Testbed Dataset (`--dataset_name`)
This file contains the SWE-bench Atlas dataset: `dataset/swe_bench_altas_eval.jsonl`

Simply use the dataset name directly in the `--dataset_name` argument (see examples below).

#### B. Model Predictions (`--predictions_path`)
This is the file you create, containing the patches generated by your model. Each line must be a single JSON object with the following structure:

* `instance_id` (string): A unique identifier in the format `repo_owner__repo_name-pull_request_number`. This must match an `instance_id` from the testbed dataset.
* `model_name_or_path` (string): An identifier for your model (e.g., "gpt-4-turbo").
* `model_patch` (string): The full diff/patch content generated by the model.

For testing purposes, we've included a testing predictions.jsonl (reusing gold_patch) in the `dataset/` folder. 

> **Prediction File Format Example:**
> ```json
> {"instance_id": "sympy__sympy-20590", "model_name_or_path": "gpt-4", "model_patch": "diff --git a/sympy/core/sympify.py b/sympy/core/sympify.py\nindex 6a73a83..fb90e1a 100644\n--- a/sympy/core/sympify.py\n+++ b/sympy/core/sympify.py\n@@ -508,7 +508,7 @@ def sympify(a, locals=None, convert_xor=True, strict=False, rational=False,\n         converter[type(a)],\n         (SympifyError,\n          OverflowError,\n-         ValueError)):\n+         ValueError, AttributeError)):\n     return a\n"}
> {"instance_id": "another__repo-12345", "model_name_or_path": "gpt-4", "model_patch": "..."}
> ```


## 3a. Run the Evaluation

The main evaluation script is `swebench.harness.run_evaluation`. It should be executed from the root of the `SWE-Bench` repository.

### Evaluating Your Model

To evaluate your own model, simply point `--predictions_path` to your custom predictions file.
```bash
python -m swebench.harness.run_evaluation \
    --dataset_name <path/to/your/swe_bench_atlas_eval.jsonl> \
    --predictions_path <path/to/your/predictions.jsonl> \
    --namespace "" \
    --run_id <run_id> \
    --instance_ids <instance_id> \
    --atlas_eval
```

### Command Breakdown
| Argument | Description |
| :--- | :--- |
| `--dataset_name` | Path to the testbed `.jsonl` file. |
| `--predictions_path` | Path to your model-generated predictions `.jsonl` file. |
| `--run_id` | A unique name for your evaluation run (e.g., `gpt-4-turbo-run-1`). This name will be used for the output log directory. |
| `--namespace` | The Docker Hub namespace for the environment images. Defaults to `swe-bench`. |
| `--max_workers`| The number of parallel processes to use. Defaults to the number of CPU cores. |
| `--cache_level` | Level of caching for Docker images. Defaults to cache `env` (Cache base and environment images)|
| `--clean`| Whether to clean up resources after evaluation. Defaults to true |
| `--instance_ids`| Specific instances to evaluate (comma-separated)
| `--timeout`| Maximum time (seconds) for evaluating each instance

For a complete list of arguments, run:
```bash
python -m swebench.harness.run_evaluation --help
```

### Cache Levels

The `--cache_level` parameter controls how Docker images are cached between runs:

| Level | Description | Storage Impact | Speed |
|-------|-------------|----------------|-------|
| `none` | No caching | Minimal (~120GB during run) | Slowest |
| `base` | Cache only base image | Minimal (~120GB during run) | Slow |
| `env` (default) | Cache base and environment images | Moderate (~100GB) | Moderate |
| `instance` | Cache all images | High (~2,000GB) | Fastest |

Most users should use the default `env` level, which provides a good balance between speed and storage usage.

## 3b. Building and Persisting Images Only

In some cases, you may want to build the Docker environment images for the dataset **without running the evaluation**.

By default, `run_evaluation` cleans up images after execution. To persist them, use the `prepare_images` utility:

```bash
python -m swebench.harness.prepare_images \
    --dataset_name <path/to/your/swe_bench_atlas_eval.jsonl> \
    --tag prebuilt_v1
```

### Command Breakdown

| Argument         | Description                                                                                                                         |
| :--------------- | :---------------------------------------------------------------------------------------------------------------------------------- |
| `--dataset_name` | Path to the testbed `.jsonl` dataset file.                                                                                          |
| `--tag`          | A custom tag to assign to the built Docker images. Use this to differentiate between different builds (e.g., `prebuilt_v1`). |benchmark.                                                           |

### Example

```bash
python -m swebench.harness.prepare_images \
    --dataset_name <path/to/your/swe_bench_atlas_eval.jsonl> \
    --tag prebuilt_v1
```

After this command, the built images will remain available locally (they will **not** be deleted automatically). You can then run evaluations which will automatically use the already built images.

## 4. Understanding the Output Directory

All evaluation artifacts are stored in the `logs/` directory, inside a folder named after your `--run_id`. The final, aggregated report is generated as a `.json` file at the root of the repository.

The logs are organized by a unique `--run_id` that you provide for each evaluation.

### Directory Structure

The `logs/` directory contains two main sub-directories: one for the Docker image build process and one for the evaluation runs themselves.

```
logs/
├── build_images/
│   └── instances/
│       └── {docker_env_instance_id}/
│           ├── Dockerfile
│           ├── build_image.log
│           └── setup_repo.sh
└── run_evaluation/
    └── {run_id}/
        └── {instance_id}/
            ├── report.json
            ├── run_instance.log
            ├── test_output_after.log
            ├── patch.diff
            └── eval.sh
```

### **File Explanations**

#### Environment Build Logs (`logs/build_images/...`)

This directory contains the files related to building the specific Docker environment for a given task. You should inspect these files if an instance fails very early with a Docker-related error.

| File | Purpose & How to Use It |
| :--- | :--- |
| `Dockerfile` | This is the exact Dockerfile generated by the harness to create the testing environment. Review this file to see which base image was used and what dependencies were installed. |
| `build_image.log` | Contains the complete log from the `docker build` command. **Look here first for environment setup failures**, such as a failed `apt-get install` or a Docker daemon error. |
| `setup_repo.sh` | An auxiliary script that is copied into the Docker image. It handles cloning the repository and checking out the correct commit. |

#### Evaluation Run Logs (`logs/run_evaluation/{run_id}/{instance_id}/`)

This is the most important directory for debugging. For each instance in your run, a folder is created containing a detailed breakdown of the evaluation process.

| File | Purpose & How to Use It |
| :--- | :--- |
| `run_instance.log` | **The master log for the instance.** This is the first file you should check for any failure. It contains high-level logs of the entire process: applying the patch, running the tests, and reporting the results. |
| `report.json` | A machine-readable summary of the final outcome for this single instance, including whether the task was resolved and other key metrics. |
| `test_output_after.log` | **The raw, unfiltered output from the test command** (e.g., `pytest`, `mvn test`). If `run_instance.log` shows that the tests ran but failed, this file will contain the specific error messages, stack traces, and test failures. |
| `patch.diff` | The exact patch generated by your model that was applied to the code before running the tests. Use this to verify that the patch was parsed correctly from your predictions file. |
| `eval.sh` | The shell script generated from the test specification that is executed inside the Docker container. This file shows the precise command used to run the tests. |


### Key Metrics
* **Total Instances**: Total number of problems in the testbed dataset.
* **Instances Submitted**: Number of instances for which your file provided a prediction.
* **Instances Completed**: Number of instances that ran to completion without crashing or timing out.
* **Instances Resolved**: The number of instances where the model's patch successfully passed the test suite.
* **Resolution Rate**: The percentage of *completed* instances that were successfully resolved (calculated as `Resolved ÷ Completed × 100%`).

## 5. Troubleshooting

If you encounter issues during evaluation, follow these steps:

### General Troubleshooting Steps

1.  **Ensure Docker is Running**: The most common issue is the Docker daemon not being active or accessible.
2.  **Verify Prediction File**: Double-check that your predictions file is a valid `.jsonl` file (one complete, valid JSON object per line). Online JSONL validators can help.
3.  **Examine Logs**: The most detailed error information can be found within the run-specific log files inside `logs/<your_run_id>/`.
4.  **Debug with a Single Worker**: If the script is crashing, running with a single worker provides clearer, sequential logs that make it easier to pinpoint the error. Add `--max_workers 1` to your run command.
5.  **Manage Disk Space**: Evaluation can consume significant disk space. Periodically run `docker system prune` to clear unused Docker images and containers, or use the `--cache_level=base` flag to minimize the storage used for Docker images.
