# SWE-Perf Evaluation

This repository provides scripts to evaluate patch predictions on the **SWE-Perf** benchmark.

## 1. Evaluation Workflow
The evaluation consists of two steps:
- run_evaluation – Runs each prediction instance individually to measure runtime and collect logs.
- check_evaluation – Processes the logs from step 1 and aggregates performance metrics into a CSV.

Run the evaluation with the following commands:

```bash
python -m run_evaluation \
    --dataset_name SWE-Perf/SWE-Perf \
    --split test \
    --predictions_path <path_to_predictions> \
    --max_workers <num_workers> \
    --run_id <run_id>

python -m check_evaluation \
    --dataset_dir SWE-Perf/SWE-Perf \
    --split test \
    --log_root <path_to_log> \
    --output_path <csv_output_path>
```

## 2. Arguments

### For `run_evaluation`

* **`--dataset_name`**: Dataset name (`SWE-Perf/SWE-Perf`).
* **`--predictions_path`**: Path to the model prediction file (see format below).
* **`--max_workers`**: Number of parallel workers for evaluation.
  **Note:** Each worker requires **5 CPU nodes**, so:

  ```
  max_workers < total_CPU_nodes / 5
  ```
* **`--run_id`**: Identifier for this evaluation run.

### For `check_evaluation`

* **`--dataset_dir`**: Dataset name (`SWE-Perf/SWE-Perf`).
* **`--log_root`**: Path to the evaluation logs.
  This should be the log path output by the first command, typically:

  ```
  ../datasets/logs/run_evaluation/<run_id>/<model_name_or_path>
  ```
* **`--output_path`**: Output CSV file path containing the aggregated results.


## 3. Prediction File Format

The input file (`<path_to_predictions>`) must be a **JSONL** file, where each line represents one prediction instance.

Each instance must include at least the following fields:

* **`instance_id`** *(string)*: Unique identifier of the instance.
* **`model_name_or_path`** *(string)*: Model name or path.
* **`model_patch`** *(string)*: The patch generated by the model (in unified diff format).

**Example:**

```json
{"instance_id": "astropy__astropy-16065",
 "model_name_or_path": "deepseek-ai/DeepSeek-V3",
 "model_patch": "diff --git a/astropy/utils/diff.py b/astropy/utils/diff.py\nindex 0c77e3e..6cbc0ae 100644\n--- a/astropy/utils/diff.py\n+++ b/astropy/utils/diff.py\n@@ -194,12 +194,20 @@ def where_not_allclose(a, b, rtol=1e-5, atol=1e-8):\n     \"\"\"\n     # Create fixed mask arrays to handle INF and NaN; currently INF and NaN\n     # are handled as equivalent\n-    if not np.all(np.isfinite(a)):\n-        a = np.ma.fix_invalid(a).data\n-    if not np.all(np.isfinite(b)):\n-        b = np.ma.fix_invalid(b).data\n-\n     if atol == 0.0 and rtol == 0.0:\n-        # Use a faster comparison for the most simple (and common) case\n+        # Fast path for exact comparison\n         return np.where(a != b)\n-    return np.where(np.abs(a - b) > (atol + rtol * np.abs(b)))\n+    \n+    # Only fix invalid values if needed\n+    a_fixed = a\n+    b_fixed = b\n+    if not np.all(np.isfinite(a)):\n+        a_fixed = np.nan_to_num(a, copy=False)\n+    if not np.all(np.isfinite(b)):\n+        b_fixed = np.nan_to_num(b, copy=False)\n+    \n+    # Compute the comparison more efficiently\n+    diff = np.subtract(a_fixed, b_fixed, dtype=np.float64)\n+    abs_diff = np.abs(diff, out=diff)  # Reuse memory\n+    rhs = np.add(atol, np.multiply(rtol, np.abs(b_fixed), dtype=np.float64))\n+    return np.where(abs_diff > rhs)\n"}
```

## 4. Example Commands

A ready-to-run example command shell is provided in [run_evaluation.sh](/evaluation/run_evaluation.sh)

You can use it as a template for your own evaluation runs.

