# V2 Qualitative Side-by-Side Review

**Run:** `full_gpt2_small_stronger_20260430_200556`  
**Variants:** V2 stronger ablation ladder (4 variants)  
**Validation examples:** 20  
**Checkpoint epochs reviewed:** initial (epoch_005) + final (epoch_020)

Scores below are automatic proxies. Manual review of the prose column
is the primary purpose of this document.

## Aggregate Metrics (final checkpoint, automatic)

| Variant | BLEU-1 | ROUGE-L | Time claim % | Space claim % | Correct claim % |
|---|---|---|---|---|---|
| `No Claim→Claim Attn` | 0.1792 | 0.2015 | 1.0 | 1.0 | 1.0 |
| `Claims from Expl Only` | 0.183 | 0.1976 | 0.8 | 0.6 | 1.0 |
| `Surface Bottleneck` | 0.148 | 0.1674 | 1.0 | 1.0 | 1.0 |
| `Surface + No Expl LM` | 0.0014 | 0.0012 | 0.0 | 0.15 | 0.15 |

## Per-Sample Generations

Each sample shows: reference explanation, mismatched training explanation,
and per-variant generated prose at initial and final checkpoints.
Columns: Variant | Ckpt | BLEU-1 | ROUGE-L | Emits claims? | Prose

### Sample 1: `check_all_pairs_equal`

Ground truth: time=`O(n^2)`, space=`O(1)`, correct=`1`

**Reference explanation:** Checks if all elements are equal via pairwise comparison. O(n^2) time, O(1) space.

**Mismatched training explanation:** Searches a list sequentially for a target value. O(n) time, O(1) space.

```python
def check_all_pairs_equal(lst):
    for i in range(len(lst)):
        for j in range(i + 1, len(lst)):
            if lst[i] != lst[j]:
                return False
    return True

```

| Variant | Ckpt | BLEU-1 | ROUGE-L | Time✓ | Space✓ | Correct✓ | Generated prose |
|---|---|---|---|---|---|---|---|
| `No Claim→Claim Attn` | initial (ep5) | 0.2308 | 0.2857 | ✓ | ✓ | ✓ | Computesn! it er atively. O(n) time, O(1) space. |
| `No Claim→Claim Attn` | final (ep20) | 0.2308 | 0.2857 | ✓ | ✓ | ✓ | Computesn! it er atively. O(n) time, O(1) space. |
| `Claims from Expl Only` | initial (ep5) | 0.2308 | 0.2857 | ✗ | ✓ | ✓ | Computesn! it er atively. O(n) time, O(1) space. |
| `Claims from Expl Only` | final (ep20) | 0.2308 | 0.2857 | ✓ | ✗ | ✓ | Computesn! it er atively. O(n) time, O(1) space. |
| `Surface Bottleneck` | initial (ep5) | 0.2 | 0.2143 | ✓ | ✓ | ✓ | Checkswheth er an in g er isssed in gtememem in ean. O(n) time, O(1) space. |
| `Surface Bottleneck` | final (ep20) | 0.2308 | 0.2857 | ✓ | ✓ | ✓ | Computesn! it er atively. O(n) time, O(1) space. |
| `Surface + No Expl LM` | initial (ep5) | 0.0 | 0.0 | ✗ | ✗ | ✗ | ( ( ( ( ( i ( i ( ( return True <sep> <sep> ( ( ( ( ( ( <sep> ( <sep> ( ( ( ( ( ( ( <sep> ( ( <sep> ( ( ( ( ( <sep> ( ( ( ( ( ( ( ( ( <sep> ( ( ( ( ( ( ( (,,,,, ( ( ( ( ( ( ( ):,, ( <sep> ( ( ( ):, |
| `Surface + No Expl LM` | final (ep20) | 0.0 | 0.0 | ✗ | ✗ | ✗ | ( ( ( ( ( i ( ( ( ( return True <sep> ( ( ( ( ( ( x, s <sep> ( ( ( ( ( s ( ss ( ( ss ( ( ( <sep> ( ( ( ( ssssssssssss ( ( ( ( ( ( ssssssssss ( ( ( ( ( (,, |

### Sample 2: `generate_all_pairs`

Ground truth: time=`O(n^2)`, space=`O(n^2)`, correct=`1`

**Reference explanation:** Generates all ordered pairs. O(n^2) time and O(n^2) space.

**Mismatched training explanation:** Builds a frequency map in O(n) time and O(n) space.

```python
def generate_all_pairs(lst):
    pairs = []
    for i in range(len(lst)):
        for j in range(len(lst)):
            pairs.append((lst[i], lst[j]))
    return pairs

```

| Variant | Ckpt | BLEU-1 | ROUGE-L | Time✓ | Space✓ | Correct✓ | Generated prose |
|---|---|---|---|---|---|---|---|
| `No Claim→Claim Attn` | initial (ep5) | 0.0714 | 0.087 | ✓ | ✓ | ✓ | Checkswheth er a list iss ce nd in gord er. O(n) time, O(1) space. |
| `No Claim→Claim Attn` | final (ep20) | 0.25 | 0.2857 | ✓ | ✓ | ✓ | Re turnsuniqueelementspre se rv in gord er. O(n) time and O(n) space. |
| `Claims from Expl Only` | initial (ep5) | 0.0714 | 0.087 | ✗ | ✗ | ✓ | Checkswheth er a list iss ce nd in gord er. O(n) time, O(1) space. |
| `Claims from Expl Only` | final (ep20) | 0.25 | 0.2857 | ✓ | ✗ | ✓ | Re turnsuniqueelementspre se rv in gord er. O(n) time and O(n) space. |
| `Surface Bottleneck` | initial (ep5) | 0.0625 | 0.08 | ✓ | ✓ | ✓ | Checkswheth er a list iss ce nd in gorted in gord er. O(n) time, O(1) space. |
| `Surface Bottleneck` | final (ep20) | 0.1765 | 0.2308 | ✓ | ✓ | ✓ | Re turnsthesign of anumb er ( - 1, 0, or 1 ) in O(1) time and space. |
| `Surface + No Expl LM` | initial (ep5) | 0.0 | 0.0 | ✗ | ✗ | ✗ | <sep> <sep> <sep> <sep> <sep> <sep> <sep> <sep> <sep> pp <sep> <sep> ( <sep> ( <sep> ( ( ( ( <sep> ( ( <sep> ( <sep> ( ( ( <sep> ( ( ( ( <sep> <sep> ( ( <sep> ( <sep> ( ( ( ( ( ( ( <sep> <sep> ( ( <sep> ( ( <sep> ( ( ( ( <sep> ( ( ( <sep> <sep> pppppp <sep> ( <sep> ( <sep> ( <sep> |
| `Surface + No Expl LM` | final (ep20) | 0.0 | 0.0 | ✗ | ✗ | ✗ | sssssssssssssssssssssssssssssssssssssssssssisippssipsssipsipssisipsipssssssipsip |

### Sample 3: `selection_sort`

Ground truth: time=`O(n^2)`, space=`O(1)`, correct=`1`

**Reference explanation:** Selection sort: selects the minimum in O(n^2) nested passes, O(1) space.

**Mismatched training explanation:** Detects duplicates with nested loops. O(n^2) time, O(1) space.

```python
def selection_sort(lst):
    for i in range(len(lst)):
        min_idx = i
        for j in range(i + 1, len(lst)):
            if lst[j] < lst[min_idx]:
                min_idx = j
        lst[i], lst[min_idx] = lst[min_idx], lst[i]
    return lst

```

| Variant | Ckpt | BLEU-1 | ROUGE-L | Time✓ | Space✓ | Correct✓ | Generated prose |
|---|---|---|---|---|---|---|---|
| `No Claim→Claim Attn` | initial (ep5) | 0.0 | 0.0 | ✓ | ✓ | ✓ | Computesthesum of a list withass. Tity O(n), sp a ce O(1). |
| `No Claim→Claim Attn` | final (ep20) | 0.1765 | 0.2143 | ✓ | ✓ | ✓ | Re turnsthesign of anumb er ( - 1, 0, or 1 ) in O(1) time and space. |
| `Claims from Expl Only` | initial (ep5) | 0.1818 | 0.2 | ✗ | ✗ | ✓ | Computesthevalue of anumb er in constant time and space. |
| `Claims from Expl Only` | final (ep20) | 0.0667 | 0.0769 | ✗ | ✓ | ✓ | Computesthesum of a list withas in glepass. Timecom pl exity O(n), sp a ce O(1). |
| `Surface Bottleneck` | initial (ep5) | 0.0909 | 0.1176 | ✓ | ✓ | ✓ | Computesprefixs. O(n) time and O(n) space. |
| `Surface Bottleneck` | final (ep20) | 0.0667 | 0.0769 | ✓ | ✓ | ✓ | Computesthesum of a list withas in glepass. Timecom pl exity O(n), sp a ce O(1). |
| `Surface + No Expl LM` | initial (ep5) | 0.0 | 0.0 | ✗ | ✗ | ✗ | ( ( ( ( ( ( ( ( ( x ], ( ( ( ( ( ( x, ls t <sep> ( ( x return r ( ( <sep> ( ( x <sep> ( ( ( <sep> ( ( ( <sep> ( <sep> ( ( ( x, y <sep> ( ( <sep> ( ( ( ( <sep> ( <sep> ( <sep> ( ( ( ( ( ( ( return, ( ( ( ( ( ( ( |
| `Surface + No Expl LM` | final (ep20) | 0.0 | 0.0 | ✗ | ✗ | ✗ | ( ( ( ( ( ( ( ( ( x <sep> ( ( ( ( ( ( ( return, ls t <sep> ( ( ( ( ( ( ( ( ( ( x, <sep> ( ( ( ( ( ( <sep> ( ( ( ):,, ( <sep> ( ( ( <sep> ( ( ( ( <sep> ( ): return, ( ( <sep> ( ( ( ( return, ( ( ( ( ( ( ( |

### Sample 4: `selection_sort`

Ground truth: time=`O(n^2)`, space=`O(1)`, correct=`1`

**Reference explanation:** Selection sort: selects the minimum in O(n^2) nested passes, O(1) space.

**Mismatched training explanation:** Detects duplicates with nested loops. O(n^2) time, O(1) space.

```python
def selection_sort(lst):
    for i in range(len(lst)):
        min_idx = i
        for j in range(i + 1, len(lst)):
            if lst[j] < lst[min_idx]:
                min_idx = j
        lst[i], lst[min_idx] = lst[min_idx], lst[i]
    return lst

```

| Variant | Ckpt | BLEU-1 | ROUGE-L | Time✓ | Space✓ | Correct✓ | Generated prose |
|---|---|---|---|---|---|---|---|
| `No Claim→Claim Attn` | initial (ep5) | 0.0 | 0.0 | ✓ | ✓ | ✓ | Computesthesum of a list withass. Tity O(n), sp a ce O(1). |
| `No Claim→Claim Attn` | final (ep20) | 0.1765 | 0.2143 | ✓ | ✓ | ✓ | Re turnsthesign of anumb er ( - 1, 0, or 1 ) in O(1) time and space. |
| `Claims from Expl Only` | initial (ep5) | 0.1818 | 0.2 | ✗ | ✗ | ✓ | Computesthevalue of anumb er in constant time and space. |
| `Claims from Expl Only` | final (ep20) | 0.0667 | 0.0769 | ✗ | ✓ | ✓ | Computesthesum of a list withas in glepass. Timecom pl exity O(n), sp a ce O(1). |
| `Surface Bottleneck` | initial (ep5) | 0.0909 | 0.1176 | ✓ | ✓ | ✓ | Computesprefixs. O(n) time and O(n) space. |
| `Surface Bottleneck` | final (ep20) | 0.0667 | 0.0769 | ✓ | ✓ | ✓ | Computesthesum of a list withas in glepass. Timecom pl exity O(n), sp a ce O(1). |
| `Surface + No Expl LM` | initial (ep5) | 0.0 | 0.0 | ✗ | ✗ | ✗ | ( ( ( ( ( ( ( ( ( x ], ( ( ( ( ( ( x, ls t <sep> ( ( x return r ( ( <sep> ( ( x <sep> ( ( ( <sep> ( ( ( <sep> ( <sep> ( ( ( x, y <sep> ( ( <sep> ( ( ( ( <sep> ( <sep> ( <sep> ( ( ( ( ( ( ( return, ( ( ( ( ( ( ( |
| `Surface + No Expl LM` | final (ep20) | 0.0 | 0.0 | ✗ | ✗ | ✗ | ( ( ( ( ( ( ( ( ( x <sep> ( ( ( ( ( ( ( return, ls t <sep> ( ( ( ( ( ( ( ( ( ( x, <sep> ( ( ( ( ( ( <sep> ( ( ( ):,, ( <sep> ( ( ( <sep> ( ( ( ( <sep> ( ): return, ( ( <sep> ( ( ( ( return, ( ( ( ( ( ( ( |

### Sample 5: `matrix_multiply_buggy`

Ground truth: time=`O(n^2)`, space=`O(1)`, correct=`0`

**Reference explanation:** Attempts 2x2 matrix multiplication but is buggy (missing accumulation). O(n^2) time, O(1) auxiliary space.

**Mismatched training explanation:** Builds a frequency map in O(n) time and O(n) space.

```python
def matrix_multiply_2x2(A, B):
    C = [[0, 0], [0, 0]]
    for i in range(2):
        for j in range(2):
            C[i][j] = A[i][0] * B[0][j]  # bug: missing second term
    return C

```

| Variant | Ckpt | BLEU-1 | ROUGE-L | Time✓ | Space✓ | Correct✓ | Generated prose |
|---|---|---|---|---|---|---|---|
| `No Claim→Claim Attn` | initial (ep5) | 0.0714 | 0.0909 | ✓ | ✓ | ✓ | Computesorixsum of aray. O(n) time and O(n) space. |
| `No Claim→Claim Attn` | final (ep20) | 0.0714 | 0.1 | ✓ | ✓ | ✓ | Computesprefixsumarray. O(n) time and O(n) space. |
| `Claims from Expl Only` | initial (ep5) | 0.0714 | 0.0909 | ✗ | ✓ | ✗ | Computesorixsum of array. O(n) time and O(n) space. |
| `Claims from Expl Only` | final (ep20) | 0.0714 | 0.1 | ✓ | ✗ | ✓ | Computesprefixsumarray. O(n) time and O(n) space. |
| `Surface Bottleneck` | initial (ep5) | 0.0714 | 0.1 | ✓ | ✓ | ✓ | Computesprray. O(n) time and O(n) space. |
| `Surface Bottleneck` | final (ep20) | 0.0714 | 0.1 | ✓ | ✓ | ✓ | Computesprefixsumarray. O(n) time and O(n) space. |
| `Surface + No Expl LM` | initial (ep5) | 0.0 | 0.0 | ✗ | ✓ | ✓ | </claim> |
| `Surface + No Expl LM` | final (ep20) | 0.0 | 0.0 | ✗ | ✓ | ✓ | </claim> |

### Sample 6: `matrix_multiply_buggy`

Ground truth: time=`O(n^2)`, space=`O(1)`, correct=`0`

**Reference explanation:** Attempts 2x2 matrix multiplication but is buggy (missing accumulation). O(n^2) time, O(1) auxiliary space.

**Mismatched training explanation:** Attempts to swap two values but contains a logic bug, returning incorrect results.

```python
def matrix_multiply_2x2(A, B):
    C = [[0, 0], [0, 0]]
    for i in range(2):
        for j in range(2):
            C[i][j] = A[i][0] * B[0][j]  # bug: missing second term
    return C

```

| Variant | Ckpt | BLEU-1 | ROUGE-L | Time✓ | Space✓ | Correct✓ | Generated prose |
|---|---|---|---|---|---|---|---|
| `No Claim→Claim Attn` | initial (ep5) | 0.0714 | 0.0909 | ✓ | ✓ | ✓ | Computesorixsum of aray. O(n) time and O(n) space. |
| `No Claim→Claim Attn` | final (ep20) | 0.0714 | 0.1 | ✓ | ✓ | ✓ | Computesprefixsumarray. O(n) time and O(n) space. |
| `Claims from Expl Only` | initial (ep5) | 0.0714 | 0.0909 | ✗ | ✓ | ✗ | Computesorixsum of array. O(n) time and O(n) space. |
| `Claims from Expl Only` | final (ep20) | 0.0714 | 0.1 | ✓ | ✗ | ✓ | Computesprefixsumarray. O(n) time and O(n) space. |
| `Surface Bottleneck` | initial (ep5) | 0.0714 | 0.1 | ✓ | ✓ | ✓ | Computesprray. O(n) time and O(n) space. |
| `Surface Bottleneck` | final (ep20) | 0.0714 | 0.1 | ✓ | ✓ | ✓ | Computesprefixsumarray. O(n) time and O(n) space. |
| `Surface + No Expl LM` | initial (ep5) | 0.0 | 0.0 | ✗ | ✓ | ✓ | </claim> |
| `Surface + No Expl LM` | final (ep20) | 0.0 | 0.0 | ✗ | ✓ | ✓ | </claim> |

### Sample 7: `is_sorted`

Ground truth: time=`O(n)`, space=`O(1)`, correct=`1`

**Reference explanation:** Checks whether a list is sorted in ascending order. O(n) time, O(1) space.

**Mismatched training explanation:** Returns unique elements preserving order. O(n) time and O(n) space.

```python
def is_sorted(lst):
    for i in range(len(lst) - 1):
        if lst[i] > lst[i + 1]:
            return False
    return True

```

| Variant | Ckpt | BLEU-1 | ROUGE-L | Time✓ | Space✓ | Correct✓ | Generated prose |
|---|---|---|---|---|---|---|---|
| `No Claim→Claim Attn` | initial (ep5) | 0.2308 | 0.2105 | ✓ | ✓ | ✓ | Computesprefixsumarray. O(n) time and O(n) space. |
| `No Claim→Claim Attn` | final (ep20) | 0.2308 | 0.2105 | ✓ | ✓ | ✓ | Computesprefixsumarray. O(n) time and O(n) space. |
| `Claims from Expl Only` | initial (ep5) | 0.2308 | 0.2105 | ✓ | ✗ | ✗ | Computesprefixsumarray. O(n) time and O(n) space. |
| `Claims from Expl Only` | final (ep20) | 0.2308 | 0.2105 | ✓ | ✓ | ✓ | Computesprefixsumarray. O(n) time and O(n) space. |
| `Surface Bottleneck` | initial (ep5) | 0.2308 | 0.2105 | ✓ | ✓ | ✓ | Computesprefixsumarray. O(n) time and O(n) space. |
| `Surface Bottleneck` | final (ep20) | 0.2308 | 0.2105 | ✓ | ✓ | ✓ | Computesprefixsumarray. O(n) time and O(n) space. |
| `Surface + No Expl LM` | initial (ep5) | 0.0 | 0.0 | ✗ | ✗ | ✗ | ( ( ( ( ( ( ( ( ( ( <sep>, <sep> ( ( ( ( ( ( ( ( ( <sep> ( ( ( ( ( ( <sep> ( ( ( ( ( ( ( ( ( ( ( ( (,,,, ( ( ( ( ( ( ( ( ( ( ( ( ( <sep>, ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( |
| `Surface + No Expl LM` | final (ep20) | 0.0 | 0.0 | ✗ | ✗ | ✗ | ( ( ( ( ( ( ( ( ( ( (, return True <sep> ( ( ( ( ( ( ( <sep> ( ( ( ( ( (, ( ( ( ):,, ( <sep> ( ( ( ( ( ): r ( ( ( ( ): return r ( pp ( ( ( ( s <sep> <sep> ( ( ( ( ( ( ppppsspppsss |

### Sample 8: `insertion_sort`

Ground truth: time=`O(n^2)`, space=`O(1)`, correct=`1`

**Reference explanation:** Insertion sort: O(n^2) worst-case time, O(1) space.

**Mismatched training explanation:** Searches a list sequentially for a target value. O(n) time, O(1) space.

```python
def insertion_sort(lst):
    for i in range(1, len(lst)):
        key = lst[i]
        j = i - 1
        while j >= 0 and lst[j] > key:
            lst[j + 1] = lst[j]
            j -= 1
        lst[j + 1] = key
    return lst

```

| Variant | Ckpt | BLEU-1 | ROUGE-L | Time✓ | Space✓ | Correct✓ | Generated prose |
|---|---|---|---|---|---|---|---|
| `No Claim→Claim Attn` | initial (ep5) | 0.125 | 0.1333 | ✓ | ✓ | ✓ | Conv er tsCe ls iustoFahrenheit time and space. |
| `No Claim→Claim Attn` | final (ep20) | 0.1176 | 0.1667 | ✓ | ✓ | ✓ | Re turnsthesign of anumb er ( - 1, 0, or 1 ) in O(1) time and space. |
| `Claims from Expl Only` | initial (ep5) | 0.1429 | 0.1538 | ✗ | ✓ | ✓ | Computesprefixsumarray. O(n) time and O(n) space. |
| `Claims from Expl Only` | final (ep20) | 0.375 | 0.4 | ✓ | ✗ | ✓ | Computesn! it er atively. O(n) time, O(1) space. |
| `Surface Bottleneck` | initial (ep5) | 0.125 | 0.1333 | ✓ | ✓ | ✓ | Conv er tsCe ls iustoFahrenheit time and space. |
| `Surface Bottleneck` | final (ep20) | 0.1176 | 0.1667 | ✓ | ✓ | ✓ | Re turnsthesign of anumb er ( - 1, 0, or 1 ) in O(1) time and space. |
| `Surface + No Expl LM` | initial (ep5) | 0.0 | 0.0 | ✗ | ✗ | ✗ | ( ( ( ( ( ( ( ( ( ( 1, ls t <sep> ( ( ( ( ( ( ( ( ( ( ( ( ( return rey, ( ( ( ( ( x, y, y ): return rey ): return rey <sep> ( ( ( ( ( ( ( ( ( ( ( ( ( x, y, <sep> ( ( ( ( ( ( <sep> ): |
| `Surface + No Expl LM` | final (ep20) | 0.0 | 0.0 | ✗ | ✗ | ✗ | ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( <sep> ( ( ( ( ( ( ( ( ( <sep> <sep> ( ( ( ( ( ( ( ( ):,,, ( ( x, s ):, ( ( ( x, s, y ):, su ( ( ):, kessu ( <sep> ( <sep> ): r ( ( ss |

### Sample 9: `selection_sort`

Ground truth: time=`O(n^2)`, space=`O(1)`, correct=`1`

**Reference explanation:** Selection sort: selects the minimum in O(n^2) nested passes, O(1) space.

**Mismatched training explanation:** Insertion sort: O(n^2) worst-case time, O(1) space.

```python
def selection_sort(lst):
    for i in range(len(lst)):
        min_idx = i
        for j in range(i + 1, len(lst)):
            if lst[j] < lst[min_idx]:
                min_idx = j
        lst[i], lst[min_idx] = lst[min_idx], lst[i]
    return lst

```

| Variant | Ckpt | BLEU-1 | ROUGE-L | Time✓ | Space✓ | Correct✓ | Generated prose |
|---|---|---|---|---|---|---|---|
| `No Claim→Claim Attn` | initial (ep5) | 0.0 | 0.0 | ✓ | ✓ | ✓ | Computesthesum of a list withass. Tity O(n), sp a ce O(1). |
| `No Claim→Claim Attn` | final (ep20) | 0.1765 | 0.2143 | ✓ | ✓ | ✓ | Re turnsthesign of anumb er ( - 1, 0, or 1 ) in O(1) time and space. |
| `Claims from Expl Only` | initial (ep5) | 0.1818 | 0.2 | ✗ | ✗ | ✓ | Computesthevalue of anumb er in constant time and space. |
| `Claims from Expl Only` | final (ep20) | 0.0667 | 0.0769 | ✗ | ✓ | ✓ | Computesthesum of a list withas in glepass. Timecom pl exity O(n), sp a ce O(1). |
| `Surface Bottleneck` | initial (ep5) | 0.0909 | 0.1176 | ✓ | ✓ | ✓ | Computesprefixs. O(n) time and O(n) space. |
| `Surface Bottleneck` | final (ep20) | 0.0667 | 0.0769 | ✓ | ✓ | ✓ | Computesthesum of a list withas in glepass. Timecom pl exity O(n), sp a ce O(1). |
| `Surface + No Expl LM` | initial (ep5) | 0.0 | 0.0 | ✗ | ✗ | ✗ | ( ( ( ( ( ( ( ( ( x ], ( ( ( ( ( ( x, ls t <sep> ( ( x return r ( ( <sep> ( ( x <sep> ( ( ( <sep> ( ( ( <sep> ( <sep> ( ( ( x, y <sep> ( ( <sep> ( ( ( ( <sep> ( <sep> ( <sep> ( ( ( ( ( ( ( return, ( ( ( ( ( ( ( |
| `Surface + No Expl LM` | final (ep20) | 0.0 | 0.0 | ✗ | ✗ | ✗ | ( ( ( ( ( ( ( ( ( x <sep> ( ( ( ( ( ( ( return, ls t <sep> ( ( ( ( ( ( ( ( ( ( x, <sep> ( ( ( ( ( ( <sep> ( ( ( ):,, ( <sep> ( ( ( <sep> ( ( ( ( <sep> ( ): return, ( ( <sep> ( ( ( ( return, ( ( ( ( ( ( ( |

### Sample 10: `matrix_multiply_buggy`

Ground truth: time=`O(n^2)`, space=`O(1)`, correct=`0`

**Reference explanation:** Attempts 2x2 matrix multiplication but is buggy (missing accumulation). O(n^2) time, O(1) auxiliary space.

**Mismatched training explanation:** Attempts to reverse a list in-place but has a bug: it swaps an element with itself.

```python
def matrix_multiply_2x2(A, B):
    C = [[0, 0], [0, 0]]
    for i in range(2):
        for j in range(2):
            C[i][j] = A[i][0] * B[0][j]  # bug: missing second term
    return C

```

| Variant | Ckpt | BLEU-1 | ROUGE-L | Time✓ | Space✓ | Correct✓ | Generated prose |
|---|---|---|---|---|---|---|---|
| `No Claim→Claim Attn` | initial (ep5) | 0.0714 | 0.0909 | ✓ | ✓ | ✓ | Computesorixsum of aray. O(n) time and O(n) space. |
| `No Claim→Claim Attn` | final (ep20) | 0.0714 | 0.1 | ✓ | ✓ | ✓ | Computesprefixsumarray. O(n) time and O(n) space. |
| `Claims from Expl Only` | initial (ep5) | 0.0714 | 0.0909 | ✗ | ✓ | ✗ | Computesorixsum of array. O(n) time and O(n) space. |
| `Claims from Expl Only` | final (ep20) | 0.0714 | 0.1 | ✓ | ✗ | ✓ | Computesprefixsumarray. O(n) time and O(n) space. |
| `Surface Bottleneck` | initial (ep5) | 0.0714 | 0.1 | ✓ | ✓ | ✓ | Computesprray. O(n) time and O(n) space. |
| `Surface Bottleneck` | final (ep20) | 0.0714 | 0.1 | ✓ | ✓ | ✓ | Computesprefixsumarray. O(n) time and O(n) space. |
| `Surface + No Expl LM` | initial (ep5) | 0.0 | 0.0 | ✗ | ✓ | ✓ | </claim> |
| `Surface + No Expl LM` | final (ep20) | 0.0 | 0.0 | ✗ | ✓ | ✓ | </claim> |

### Sample 11: `sign`

Ground truth: time=`O(1)`, space=`O(1)`, correct=`1`

**Reference explanation:** Returns the sign of a number (-1, 0, or 1) in O(1) time and space.

**Mismatched training explanation:** Checks if all elements are equal via pairwise comparison. O(n^2) time, O(1) space.

```python
def sign(x):
    if x > 0: return 1
    if x < 0: return -1
    return 0

```

| Variant | Ckpt | BLEU-1 | ROUGE-L | Time✓ | Space✓ | Correct✓ | Generated prose |
|---|---|---|---|---|---|---|---|
| `No Claim→Claim Attn` | initial (ep5) | 0.2 | 0.2857 | ✓ | ✓ | ✓ | Computesprefixsumarray. O(n) time and O(n) space. |
| `No Claim→Claim Attn` | final (ep20) | 0.3529 | 0.375 | ✓ | ✓ | ✓ | Re turnsthelarg er of twonumb er sus in gacondit io nalexpress io n. O(1) time and space. |
| `Claims from Expl Only` | initial (ep5) | 0.2 | 0.2857 | ✗ | ✓ | ✓ | Computesprefixsumarray. O(n) time and O(n) space. |
| `Claims from Expl Only` | final (ep20) | 0.1333 | 0.1739 | ✓ | ✓ | ✓ | Computesn! it er atively. O(n) time, O(1) space. |
| `Surface Bottleneck` | initial (ep5) | 0.2 | 0.2857 | ✓ | ✓ | ✓ | Computesprefixsumarray. O(n) time and O(n) space. |
| `Surface Bottleneck` | final (ep20) | 0.3529 | 0.375 | ✓ | ✓ | ✓ | Re turnsthelarg er of twonumb er sus in gacondit io nalexpress io n. O(1) time and space. |
| `Surface + No Expl LM` | initial (ep5) | 0.0 | 0.0 | ✓ | ✗ | ✗ | ( ( ( 1 ( ( ( ( ( ( ): ( ( x, <sep> ( ( ( ( ): ): ( ( ( ( ( ( ( ( ( ( ): (, ( ( ): ( ): ( ( ( ( ( ( ( time_complexity=O(1), ( ( ( (, ( ): ( ( ( ): ( ( time_complexity=O(1) g ( ( ( time_complexity=O(1) return, ( ( ( ( ( ( ( ( ( ( |
| `Surface + No Expl LM` | final (ep20) | 0.0 | 0.0 | ✗ | ✗ | ✗ | ( ( ( ( ( x <sep> ( ( ( ( ( ( ): ( ): ( ): ( x,, 0 <sep> ( ( ( ( ( ( ( ( ): ( ): ( ( ( ): ): ( ( ( ( ( ( ):, ( ( ( ): ( ): ( [i] ( ( ( g ( [i] *, return r ( g ( p <sep> <sep> ( ( ( ( ( ( ( ( |

### Sample 12: `string_contains`

Ground truth: time=`O(n)`, space=`O(1)`, correct=`1`

**Reference explanation:** Checks if a string contains a character via linear scan. O(n) time, O(1) space.

**Mismatched training explanation:** Reverses a string character by character. O(n) time, O(n) space due to string accumulation.

```python
def string_contains(s, ch):
    for c in s:
        if c == ch:
            return True
    return False

```

| Variant | Ckpt | BLEU-1 | ROUGE-L | Time✓ | Space✓ | Correct✓ | Generated prose |
|---|---|---|---|---|---|---|---|
| `No Claim→Claim Attn` | initial (ep5) | 0.2143 | 0.2 | ✓ | ✓ | ✗ | Computesprefixsumarray. O(n) time and O(n) space. |
| `No Claim→Claim Attn` | final (ep20) | 0.1176 | 0.129 | ✓ | ✓ | ✓ | Re turnsthelarg er of twonumb er sus in gacondit io nalexpress io n. O(1) time and space. |
| `Claims from Expl Only` | initial (ep5) | 0.2143 | 0.2 | ✓ | ✓ | ✗ | Computesprefixsumarray. O(n) time and O(n) space. |
| `Claims from Expl Only` | final (ep20) | 0.1176 | 0.129 | ✓ | ✓ | ✓ | Re turnsthelarg er of twonumb er sus in gacondit io nalexpress io n. O(1) time and space. |
| `Surface Bottleneck` | initial (ep5) | 0.2143 | 0.2 | ✓ | ✓ | ✓ | Computesprefixsumarray. O(n) time and O(n) space. |
| `Surface Bottleneck` | final (ep20) | 0.1176 | 0.129 | ✓ | ✓ | ✓ | Re turnsthelarg er of twonumb er sus in gacondit io nalexpress io n. O(1) time and space. |
| `Surface + No Expl LM` | initial (ep5) | 0.0 | 0.0 | ✗ | ✗ | ✗ | ( ( ( ( ( ( <sep> ( ( ( ( <sep> ( ( ( ( ( (, ( ( ( <sep> ( ( ( ( ( ( <sep> ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( (, ( ( (, ( ( ):,, ( ( ( ( ( ( ( ( ( ):, ( ( ( ( ( ( ( ( ( ( ( ( p |
| `Surface + No Expl LM` | final (ep20) | 0.0 | 0.0 | ✗ | ✗ | ✗ | ( ( ( ( ( ( <sep> ( ( ( ( ( <sep> ( ( ( ( ( ( ( ( ( <sep> ( ( ( ( ( ( ( (, ( ( ):,, ( ( ( ): return r ( <sep> ( ( ( ( ( ( ( ( ( ( ( ): return r ( ( ( ( ( ( sss ): r ( sssssspps |

### Sample 13: `all_pairs_sum`

Ground truth: time=`O(n^2)`, space=`O(1)`, correct=`1`

**Reference explanation:** Sums all pairs from the list. Double nested loop gives O(n^2) time, O(1) space.

**Mismatched training explanation:** Checks whether a list is sorted in ascending order. O(n) time, O(1) space.

```python
def all_pairs_sum(lst):
    total = 0
    for i in range(len(lst)):
        for j in range(len(lst)):
            total += lst[i] + lst[j]
    return total

```

| Variant | Ckpt | BLEU-1 | ROUGE-L | Time✓ | Space✓ | Correct✓ | Generated prose |
|---|---|---|---|---|---|---|---|
| `No Claim→Claim Attn` | initial (ep5) | 0.2143 | 0.2727 | ✓ | ✓ | ✓ | Computesn! it er atively. O(n) time, O(1) space. |
| `No Claim→Claim Attn` | final (ep20) | 0.1176 | 0.129 | ✓ | ✓ | ✓ | Re turnsthesign of anumb er ( - 1, 0, or 1 ) in O(1) time and space. |
| `Claims from Expl Only` | initial (ep5) | 0.0714 | 0.1 | ✗ | ✓ | ✓ | Computesprefixsumarray. O(n) time and O(n) space. |
| `Claims from Expl Only` | final (ep20) | 0.0714 | 0.1 | ✓ | ✓ | ✓ | Computesprefixsumarray. O(n) time and O(n) space. |
| `Surface Bottleneck` | initial (ep5) | 0.0714 | 0.08 | ✓ | ✓ | ✓ | Computesn! it er al / ars. O(n) time and O(n) space. |
| `Surface Bottleneck` | final (ep20) | 0.1176 | 0.129 | ✓ | ✓ | ✓ | Re turnsthesign of anumb er ( - 1, 0, or 1 ) in O(1) time and space. |
| `Surface + No Expl LM` | initial (ep5) | 0.0 | 0.0 | ✗ | ✗ | ✗ | ( ( ( ( ( ( ( ( ( 1, ls t ( ( <sep> ( ( ( ( ( return re <sep> ( <sep> ( ( ( ( ( ( ( ( <sep> <sep> ( ( ( ( ( ( <sep> ( ( ( ( ( ( ( ( ( <sep> ( ( ( ( ( ( ( ( <sep>,, ( <sep> ( ( ( ( ( ( ( <sep> ( ( ( <sep> ( |
| `Surface + No Expl LM` | final (ep20) | 0.0 | 0.0 | ✗ | ✗ | ✗ | ( ( ( ( ( ( ( ( ( x + = + ( ( return res <sep> ( ( ( ( ( ( ( <sep> ( ( ( ( ( ( ssssssss ( ssss ( ( ( ( sssssssssssssssssssssssss ( ( ( ( |

### Sample 14: `max_of_two`

Ground truth: time=`O(1)`, space=`O(1)`, correct=`1`

**Reference explanation:** Returns the larger of two numbers using a conditional expression. O(1) time and space.

**Mismatched training explanation:** Naive polynomial/array convolution. O(n^2) time, O(n) output space.

```python
def max_of_two(a, b):
    return a if a >= b else b

```

| Variant | Ckpt | BLEU-1 | ROUGE-L | Time✓ | Space✓ | Correct✓ | Generated prose |
|---|---|---|---|---|---|---|---|
| `No Claim→Claim Attn` | initial (ep5) | 0.2143 | 0.2727 | ✓ | ✓ | ✗ | Computesthefixsumare of an. Tity O(n) time, O(1) space. |
| `No Claim→Claim Attn` | final (ep20) | 0.2857 | 0.3478 | ✓ | ✓ | ✓ | Computestheabsolutevalue of anumb er in constant time and space. |
| `Claims from Expl Only` | initial (ep5) | 0.2143 | 0.3 | ✗ | ✓ | ✓ | Computesprefixsumaray. O(n) time and O(n) space. |
| `Claims from Expl Only` | final (ep20) | 0.2857 | 0.3478 | ✓ | ✓ | ✓ | Computestheabsolutevalue of anumb er in constant time and space. |
| `Surface Bottleneck` | initial (ep5) | 0.2143 | 0.3 | ✓ | ✓ | ✓ | Computesprefixsumaray. O(n) time and O(n) space. |
| `Surface Bottleneck` | final (ep20) | 0.1429 | 0.1818 | ✓ | ✓ | ✓ | Computesn! it er atively. O(n) time, O(1) space. |
| `Surface + No Expl LM` | initial (ep5) | 0.0 | 0.0 | ✗ | ✗ | ✗ | ( ( ( ( (, b <sep> ( ( ( ( a, b ): return, ( ( ( ( ( ( ( ( ( x, y, ( <sep> ( ( ( ( ( ( ):, ( ( ( ( ( ( ):, y, ( ( ( ( ( ( ( ): return, y, y, ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( |
| `Surface + No Expl LM` | final (ep20) | 0.0278 | 0.0233 | ✗ | ✗ | ✗ | ( ( a ( ( a lse b ( b ( b <sep> ( ( ( ( ( a, ( ): return, ( ( ( a, y ):, <sep> ( ( ( ( ( ( <sep> ( ( ( ( ( ( ( ):, ( ( ):,, ( ): return ( ( ): return, ( ( ( ( ( ( <sep> <sep> <sep> ( ( ( ( ( ( ( ( ( |

### Sample 15: `string_reverse`

Ground truth: time=`O(n)`, space=`O(n)`, correct=`1`

**Reference explanation:** Reverses a string character by character. O(n) time, O(n) space due to string accumulation.

**Mismatched training explanation:** Insertion sort: O(n^2) worst-case time, O(1) space.

```python
def string_reverse(s):
    result = ''
    for ch in s:
        result = ch + result
    return result

```

| Variant | Ckpt | BLEU-1 | ROUGE-L | Time✓ | Space✓ | Correct✓ | Generated prose |
|---|---|---|---|---|---|---|---|
| `No Claim→Claim Attn` | initial (ep5) | 0.1429 | 0.2 | ✓ | ✗ | ✓ | Computesprefixsumarray. O(n) time and O(n) space. |
| `No Claim→Claim Attn` | final (ep20) | 0.1429 | 0.2 | ✓ | ✓ | ✓ | Computesprefixsumarray. O(n) time and O(n) space. |
| `Claims from Expl Only` | initial (ep5) | 0.1429 | 0.2 | ✓ | ✗ | ✓ | Computesprefixsumarray. O(n) time and O(n) space. |
| `Claims from Expl Only` | final (ep20) | 0.1429 | 0.2 | ✓ | ✓ | ✓ | Computesprefixsumarray. O(n) time and O(n) space. |
| `Surface Bottleneck` | initial (ep5) | 0.1429 | 0.2 | ✓ | ✓ | ✓ | Computesprefixsumarray. O(n) time and O(n) space. |
| `Surface Bottleneck` | final (ep20) | 0.1429 | 0.2 | ✓ | ✓ | ✓ | Computesprefixsumarray. O(n) time and O(n) space. |
| `Surface + No Expl LM` | initial (ep5) | 0.0 | 0.0 | ✗ | ✗ | ✗ | pppppppppppppppppppppppppres <sep> resppppppppppppprppfrrpppprefrpfrespfpfrespfrespfrf |
| `Surface + No Expl LM` | final (ep20) | 0.0 | 0.0 | ✗ | ✗ | ✗ | ppppppppppppppppppppppppppppppppppppppppppppppppppppppprppfpppfrpppfrpfpfrpppfrr |

### Sample 16: `clamp`

Ground truth: time=`O(1)`, space=`O(1)`, correct=`1`

**Reference explanation:** Clamps a value between lo and hi bounds in constant time and space.

**Mismatched training explanation:** Flattens one level of nesting in a list of lists. O(n) time and O(n) space.

```python
def clamp(x, lo, hi):
    return max(lo, min(x, hi))

```

| Variant | Ckpt | BLEU-1 | ROUGE-L | Time✓ | Space✓ | Correct✓ | Generated prose |
|---|---|---|---|---|---|---|---|
| `No Claim→Claim Attn` | initial (ep5) | 0.1333 | 0.1429 | ✓ | ✓ | ✗ | Computesarithmeticmean. The bu ilt - in s O(n); ov er al O(n) time, O(1) space. |
| `No Claim→Claim Attn` | final (ep20) | 0.2353 | 0.2667 | ✓ | ✓ | ✓ | Re turnsthesign of anumb er ( - 1, 0, or 1 ) in O(1) time and space. |
| `Claims from Expl Only` | initial (ep5) | 0.1333 | 0.1429 | ✗ | ✓ | ✗ | Computesarithmeticmean. The bu ilt - in sumis O(n); ov er all O(n) time, O(1) space. |
| `Claims from Expl Only` | final (ep20) | 0.5385 | 0.5 | ✓ | ✗ | ✓ | Re turnsthefirstelement of a list in constant time and constant space. |
| `Surface Bottleneck` | initial (ep5) | 0.1333 | 0.1429 | ✓ | ✓ | ✓ | Computesarithmeticmean. The bu ilt - in s O(n); ov er al O(n) time, O(1) space. |
| `Surface Bottleneck` | final (ep20) | 0.2353 | 0.2667 | ✓ | ✓ | ✓ | Re turnsthesign of anumb er ( - 1, 0, or 1 ) in O(1) time and space. |
| `Surface + No Expl LM` | initial (ep5) | 0.0 | 0.0 | ✓ | ✗ | ✗ | ( ( ( <sep> ( ( ( ( ( (, ( ( ( ):, ( (, ( ( (, ( ( ( ( ):, ( ( ( ( (, ( ( ):, N, ( ( ( time_complexity=O(1), ( ( ( return re ( ( ( ( ( ( ( ( ( ( ( ( time_complexity=O(1) return, ( ( ( ( ( g ( ( ( time_complexity=O(1) return, time_complexity=O(1) |
| `Surface + No Expl LM` | final (ep20) | 0.0 | 0.0 | ✗ | ✗ | ✗ | ( ( ( ): (, <sep> ( ( ( ( ( ):, ( ( ( (, ( (,, ( ( ( ):, N return ( ( ):,, <sep> ( ( ( ( ( ( ( ( ):, ( ( ( ): return r ( ( ( ( ):, ( ( ( ( ( g ( ): r ( g ( g ( k ( ( ( k ( gp |

### Sample 17: `compute_mean`

Ground truth: time=`O(n)`, space=`O(1)`, correct=`1`

**Reference explanation:** Computes arithmetic mean. The built-in sum is O(n); overall O(n) time, O(1) space.

**Mismatched training explanation:** Naive polynomial/array convolution. O(n^2) time, O(n) output space.

```python
def compute_mean(lst):
    if not lst:
        return 0.0
    return sum(lst) / len(lst)

```

| Variant | Ckpt | BLEU-1 | ROUGE-L | Time✓ | Space✓ | Correct✓ | Generated prose |
|---|---|---|---|---|---|---|---|
| `No Claim→Claim Attn` | initial (ep5) | 0.0769 | 0.0909 | ✓ | ✓ | ✓ | Computesthevalue of anumb er in constant time and space. |
| `No Claim→Claim Attn` | final (ep20) | 0.2308 | 0.16 | ✓ | ✓ | ✓ | Re turnsuniqueelementspre se rv in gord er. O(n) time and O(n) space. |
| `Claims from Expl Only` | initial (ep5) | 0.0769 | 0.0909 | ✓ | ✓ | ✗ | Computesthevalue of anumb er in constant time and space. |
| `Claims from Expl Only` | final (ep20) | 0.2308 | 0.16 | ✓ | ✗ | ✓ | Re turnsuniqueelementspre se rv in gord er. O(n) time and O(n) space. |
| `Surface Bottleneck` | initial (ep5) | 0.3077 | 0.381 | ✓ | ✓ | ✓ | Computesn! it er avely. O(n) time, O(1) space. |
| `Surface Bottleneck` | final (ep20) | 0.2308 | 0.16 | ✓ | ✓ | ✓ | Re turnsuniqueelementspre se rv in gord er. O(n) time and O(n) space. |
| `Surface + No Expl LM` | initial (ep5) | 0.0 | 0.0 | ✗ | ✗ | ✗ | ( ( ( ( <sep> ( ( ( ( <sep> ( x <sep> ( ( ( <sep> ( ( ( ( ( ( <sep> ( ( ( <sep> ( ( ( ( <sep> ( ( ( (, ( <sep> ( ( ( <sep> ( ( ( ( ( ( ( ( <sep> ( ( p ( p ( ( ( ( ( ( ( pp ( ( ( ( p ( ( ( (,,,, |
| `Surface + No Expl LM` | final (ep20) | 0.0 | 0.0 | ✗ | ✗ | ✗ | ( ( ( ( ( x <sep> ( ( ( ( x, ( ( ( <sep> ( ( ( ( ( ( <sep> ( ( ( ( ( ( ( <sep> ( ( ( ): p, ( <sep> ( ( ( <sep> ( ( ( pp ( ( ( ppppppps ( ( ( ( ppsspppppsssspps |

### Sample 18: `prefix_sums`

Ground truth: time=`O(n)`, space=`O(n)`, correct=`1`

**Reference explanation:** Computes prefix sum array. O(n) time and O(n) space.

**Mismatched training explanation:** Converts Celsius to Fahrenheit in constant time and space.

```python
def prefix_sums(lst):
    ps = [0] * (len(lst) + 1)
    for i, x in enumerate(lst):
        ps[i + 1] = ps[i] + x
    return ps

```

| Variant | Ckpt | BLEU-1 | ROUGE-L | Time✓ | Space✓ | Correct✓ | Generated prose |
|---|---|---|---|---|---|---|---|
| `No Claim→Claim Attn` | initial (ep5) | 0.3333 | 0.3333 | ✓ | ✗ | ✓ | Computesoluthevalue of anumb er in const time and space. |
| `No Claim→Claim Attn` | final (ep20) | 0.2222 | 0.2353 | ✓ | ✓ | ✓ | Computesn! it er atively. O(n) time, O(1) space. |
| `Claims from Expl Only` | initial (ep5) | 0.3333 | 0.3333 | ✓ | ✗ | ✓ | Computesthevalue of anumb er in constant time and space. |
| `Claims from Expl Only` | final (ep20) | 0.3333 | 0.3333 | ✓ | ✓ | ✓ | Computestheabsolutevalue of anumb er in constant time and space. |
| `Surface Bottleneck` | initial (ep5) | 0.2222 | 0.2353 | ✓ | ✓ | ✓ | Computesn! it er atively. O(n) time, O(1) space. |
| `Surface Bottleneck` | final (ep20) | 0.2222 | 0.2353 | ✓ | ✓ | ✓ | Computesn! it er atively. O(n) time, O(1) space. |
| `Surface + No Expl LM` | initial (ep5) | 0.0 | 0.0 | ✗ | ✗ | ✗ | ps <sep> pps <sep> ppppppppppps <sep> ppppppppppppppppppppppps <sep> ppppppppppfpppppfrpppfrpfrppfrrppfr |
| `Surface + No Expl LM` | final (ep20) | 0.0 | 0.0 | ✗ | ✗ | ✗ | pppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppp |

### Sample 19: `string_contains`

Ground truth: time=`O(n)`, space=`O(1)`, correct=`1`

**Reference explanation:** Checks if a string contains a character via linear scan. O(n) time, O(1) space.

**Mismatched training explanation:** Detects duplicates with nested loops. O(n^2) time, O(1) space.

```python
def string_contains(s, ch):
    for c in s:
        if c == ch:
            return True
    return False

```

| Variant | Ckpt | BLEU-1 | ROUGE-L | Time✓ | Space✓ | Correct✓ | Generated prose |
|---|---|---|---|---|---|---|---|
| `No Claim→Claim Attn` | initial (ep5) | 0.2143 | 0.2 | ✓ | ✓ | ✗ | Computesprefixsumarray. O(n) time and O(n) space. |
| `No Claim→Claim Attn` | final (ep20) | 0.1176 | 0.129 | ✓ | ✓ | ✓ | Re turnsthelarg er of twonumb er sus in gacondit io nalexpress io n. O(1) time and space. |
| `Claims from Expl Only` | initial (ep5) | 0.2143 | 0.2 | ✓ | ✓ | ✗ | Computesprefixsumarray. O(n) time and O(n) space. |
| `Claims from Expl Only` | final (ep20) | 0.1176 | 0.129 | ✓ | ✓ | ✓ | Re turnsthelarg er of twonumb er sus in gacondit io nalexpress io n. O(1) time and space. |
| `Surface Bottleneck` | initial (ep5) | 0.2143 | 0.2 | ✓ | ✓ | ✓ | Computesprefixsumarray. O(n) time and O(n) space. |
| `Surface Bottleneck` | final (ep20) | 0.1176 | 0.129 | ✓ | ✓ | ✓ | Re turnsthelarg er of twonumb er sus in gacondit io nalexpress io n. O(1) time and space. |
| `Surface + No Expl LM` | initial (ep5) | 0.0 | 0.0 | ✗ | ✗ | ✗ | ( ( ( ( ( ( <sep> ( ( ( ( <sep> ( ( ( ( ( (, ( ( ( <sep> ( ( ( ( ( ( <sep> ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( (, ( ( (, ( ( ):,, ( ( ( ( ( ( ( ( ( ):, ( ( ( ( ( ( ( ( ( ( ( ( p |
| `Surface + No Expl LM` | final (ep20) | 0.0 | 0.0 | ✗ | ✗ | ✗ | ( ( ( ( ( ( <sep> ( ( ( ( ( <sep> ( ( ( ( ( ( ( ( ( <sep> ( ( ( ( ( ( ( (, ( ( ):,, ( ( ( ): return r ( <sep> ( ( ( ( ( ( ( ( ( ( ( ): return r ( ( ( ( ( ( sss ): r ( sssssspps |

### Sample 20: `naive_convolution`

Ground truth: time=`O(n^2)`, space=`O(n)`, correct=`1`

**Reference explanation:** Naive polynomial/array convolution. O(n^2) time, O(n) output space.

**Mismatched training explanation:** Computes prefix sum array. O(n) time and O(n) space.

```python
def naive_convolution(a, b):
    n = len(a) + len(b) - 1
    result = [0] * n
    for i in range(len(a)):
        for j in range(len(b)):
            result[i + j] += a[i] * b[j]
    return result

```

| Variant | Ckpt | BLEU-1 | ROUGE-L | Time✓ | Space✓ | Correct✓ | Generated prose |
|---|---|---|---|---|---|---|---|
| `No Claim→Claim Attn` | initial (ep5) | 0.375 | 0.2857 | ✓ | ✓ | ✓ | Computesprefixsumaray. O(n) time and O(n) space. |
| `No Claim→Claim Attn` | final (ep20) | 0.1875 | 0.1667 | ✓ | ✓ | ✓ | Checkswheth er a list issorted in as ce nd in gord er. O(n) time, O(1) space. |
| `Claims from Expl Only` | initial (ep5) | 0.375 | 0.2857 | ✗ | ✗ | ✓ | Computespreficon. O(n) time and O(n) space. |
| `Claims from Expl Only` | final (ep20) | 0.1875 | 0.1667 | ✗ | ✓ | ✓ | Checkswheth er a list issorted in as ce nd in gord er. O(n) time, O(1) space. |
| `Surface Bottleneck` | initial (ep5) | 0.375 | 0.25 | ✓ | ✗ | ✓ | Computesprvalue of ay. O(n) time and O(n) space. |
| `Surface Bottleneck` | final (ep20) | 0.1111 | 0.1176 | ✓ | ✓ | ✓ | Computestheabsolutevalue of anumb er in constant time and space. |
| `Surface + No Expl LM` | initial (ep5) | 0.0 | 0.0 | ✗ | ✓ | ✓ | pppppppsu </claim> |
| `Surface + No Expl LM` | final (ep20) | 0.0 | 0.0 | ✗ | ✗ | ✗ | ssssssssssssssssssssssssssssssssssssssspsssspsstsipsipssssssipsipssipssipsipsips |

## Manual Scoring Guide

For each sample, score each variant's **final-checkpoint prose** on:

| Criterion | Scale | Notes |
|---|---|---|
| Fluency | 0–2 | 0=incoherent, 1=partial, 2=fluent |
| Factual accuracy | 0–2 | Does prose match ground-truth complexity/correctness? |
| Claim alignment | 0–1 | Do emitted `<claim>` tags match ground truth? |
| Initial→Final improvement | 0–1 | Did the prose improve from initial to final ckpt? |

Aggregate per-variant totals and compare across the V2 ablation ladder.
Key question: does `surface_bottleneck_consistency` or `claims_from_explanation_only`
show better factual accuracy than `no_claim_to_claim_attention` (closest to V1 baseline)?

Files in this directory:
- `qualitative_side_by_side_v2.csv` — machine-readable, one row per (variant, checkpoint, example)
- `qualitative_side_by_side_v2.json` — same data as JSON with manifest header
- `qualitative_side_by_side_v2.pplx.md` — this document