---
dataset_info:
  features:
  - name: repo
    dtype: string
  - name: commit
    dtype: string
  - name: path
    dtype: string
  - name: lang
    dtype: string
  - name: license
    dtype: string
  - name: message
    dtype: string
  - name: old_code
    dtype: string
  - name: new_code
    dtype: string
  - name: diff_string
    dtype: string
  - name: n_added
    dtype: int64
  - name: n_removed
    dtype: int64
  - name: n_hunks
    dtype: int64
  - name: change_kind
    dtype: string
  - name: header_changed_udiff
    dtype: string
  - name: line_prefix_udiff
    dtype: string
  - name: search_and_replace
    dtype: string
  splits:
  - name: test
    num_bytes: 5581156
    num_examples: 1000
  download_size: 2595562
  dataset_size: 5581156
configs:
- config_name: default
  data_files:
  - split: test
    path: data/test-*
---

# diff-xyz

diff-xyz contains 1,000 real-world code edits sampled and filtered from
the [CommitPackFT](https://huggingface.co/datasets/bigcode/commitpackft) dataset. Each example provides three
components: the original file contents (`old_code`), the modified contents (`new_code`), and the unified diff between
them (`diff_string`).

This format enables evaluation of LLM capabilities on three code editing tasks:

- **Apply**: Given old code and diff, generate new code
- **Anti-Apply**: Given new code and diff, generate old code
- **Diff-Generation**: Given old and new code, generate the unified diff

## How to Read the Dataset

```python
from datasets import load_from_disk

ds = load_from_disk("diff_xyz.arrow")
```

## How it's built

Examples were systematically filtered from CommitPackFT to ensure high quality:

- **Scope**: Single-file changes only, excluding binary files, vendor directories, and generated code
- **Quality**: Removed trivial changes (whitespace-only) and likely test files
- **Size**: Required 40+ lines in at least one version of the code (old or new); excluded very large files (1000+ lines)
- **Sampling**:
    - Target 50/50 split between single-hunk and multi-hunk edits (hunks counted via `@@` markers in the unified diff).
    - Within each hunk group, stratify by change size (lines added + removed) using the 40th/80th percentiles of that
      group, targeting ≈40% small, 40% medium, 20% large.
    - Cap examples to ≤5 per repository per language.

## Dataset Statistics (v1.0.0)

- **Total examples:** 1,000
- **Languages:** Python (200), JavaScript (200), Java (200), Kotlin (200), Rust (200)

**Hunks:**

- **1 hunk:** 500 (50.0%)
- **2+ hunks:** 500 (50.0%)

**Change size (added + removed), stratified within hunk groups:**

- **Small** (≤40th pct): 424 (42.4%)
- **Medium** (40th–80th): 388 (38.8%)
- **Large** (>80th): 188 (18.8%)

**Change type:**

- **Mixed** (additions + deletions): 815 (81.5%)
- **Add-only:** 163 (16.3%)
- **Delete-only:** 22 (2.2%)

**Repository diversity:** 891 unique repositories.

## Schema

Each JSONL record contains the following fields:

| Field         | Type   | Description                                                |
|---------------|--------|------------------------------------------------------------|
| `repo`        | string | Repository identifier (e.g., "owner/name")                 |
| `commit`      | string | Commit SHA that produced the change                        |
| `path`        | string | File path relative to repo root                            |
| `lang`        | string | One of: `python`, `javascript`, `java`, `kotlin`, `rust`   |
| `license`     | string | SPDX license identifier of the repo                        |
| `message`     | string | Original commit message                                    |
| `old_code`    | string | Complete file contents before change                       |
| `new_code`    | string | Complete file contents after change                        |
| `diff_string` | string | Unified diff from `old_code` → `new_code` (1-line context) |
| `n_added`     | int    | # of `+` lines (excluding headers)                         |
| `n_removed`   | int    | # of `-` lines (excluding headers)                         |
| `n_hunks`     | int    | # of `@@` (hunk) sections                                  |
| `change_kind` | string | `add_only`, `del_only`, or `mixed`                         |

## Example

```json
{
  "repo": "example/project",
  "commit": "abc123...",
  "path": "src/utils.py",
  "lang": "python",
  "license": "mit",
  "message": "Fix null pointer exception in utility function",
  "old_code": "def process(data):\n    return data.strip()\n",
  "new_code": "def process(data):\n    if data is None:\n        return None\n    return data.strip()\n",
  "diff_string": "--- a/src/utils.py\n+++ b/src/utils.py\n@@ -1,2 +1,4 @@\n def process(data):\n+    if data is None:\n+        return None\n     return data.strip()\n",
  "n_added": 2,
  "n_removed": 0,
  "n_hunks": 1,
  "change_kind": "add_only"
}
```
