# Harnesses
This repository contains the task instance validation and evaluation harnesses for SWE Bench.

The following diagram fully captures the procedure we used to create the SWE Bench evaluation benchmark and fine-tuning dataset. The corresponding folders and files for each step of preprocessing are labeled accordingly.

<img src="../assets/construction-ref.png">

This README will mainly discuss how the harnesses work. For information on other steps, please check the corresponding folders' README's.

## Evaluation Harness
The evaluation harness (`instances_eval.py`) is used to apply and run models' patches to determine if the generated fix is well formed and check how many tests pass/fail. The file can be invoked via the `run_instances_eval.sh` script with the following arguments:

```
python instances_eval.py \
    <predictions_path> \
    <instances_dir> \
    --path_conda \
    --testbed \
    --temp_dir \
    --timeout \
    --verbose \
    --num_workers
```

Generally, the `instances_eval.py` script takes in a set of patch predictions + arguments and attempts to apply + test the patch at the associated base commit within the corresponding repo. A log instance is generated per prediction. This log file is then compared against the original, gold patch log file (generated from `instances_check.py`) to evaluate the output. More details are included in the `eval/` directory.

## Validation Harness
The validation harness (`instances_check.py`) is used to check whether a task instance can be used as evaluation. The file can be invoked via the `run_instances_check.sh` script with the following arguments:

```
python instances_check.py \
    <instances_path> \
    <log_dir> \
    --path_conda \
    --testbed \
    --temp_dir \
    --timeout \
    --verbose \
    --num_workers
```

Generally, the `instances_check.py` script takes in these set of arguments and generates a set of logs for each given task instance corresponding to an attempt to run and test the task instance at its base commit within the repo. To develop SWE Bench, we use these logs to filter for tasks with logs that reflect the following:
* The repo must be **installed** successfully at the task instance's base commit
* The `test_patch` and `patch` objects must be **applied** successfully
* The `test_cmd` must **run** successfully
* The pre-`patch` and post-`patch` outputs of running `test_cmd` must be different
* The post-`patch` should resolve 1+ issue

