# Manual Qualitative Review of Natural-Language Explanations

## Bottom line

Manual review does not support the claim that `consistency_loss` improved natural-language explanation correctness over the baselines. Across the same 20 validation examples, none of the four variants produced a prose explanation that was fully semantically correct under a strict behavior-plus-claims rubric. The models mostly emitted a small set of memorized explanation templates, while the structured `<claim>` tokens were usually correct.

The consistency-loss variant did produce strong hidden-state coupling and correct structured claims in the quantitative run, but its prose often contradicted those claims. For example, it generated single-pass O(n) prose for O(n^2) sorting and nested-loop examples, then emitted the correct structured `time_complexity=O(n^2)` claim.

## Scoring rubric

Each generated natural-language prose segment was scored independently of the structured `<claim>` tokens on four dimensions:

- **Behavior correctness**: Does the prose describe the actual function behavior?
- **Time complexity in prose**: Does the prose state the correct time class?
- **Space complexity in prose**: Does the prose state the correct space class?
- **Bug/correctness status**: Does the prose indicate buggy behavior when the function is buggy, and avoid false bug claims for correct functions?

A prose explanation was counted as fully correct only if all four dimensions were correct. A contradiction was counted when the prose time, space, or bug-status statement disagreed with the ground-truth structured claims emitted later in the same generation.

## Aggregate manual scores

| Variant | Behavior correct | Time correct | Space correct | Bug status correct | Fully correct prose | Prose contradicts claims |
|---|---:|---:|---:|---:|---:|---:|
| `claim_only_pooling` | 0/20 (0%) | 5/20 (25%) | 16/20 (80%) | 17/20 (85%) | 0/20 (0%) | 17/20 (85%) |
| `consistency_loss` | 0/20 (0%) | 4/20 (20%) | 10/20 (50%) | 17/20 (85%) | 0/20 (0%) | 20/20 (100%) |
| `no_consistency_loss` | 0/20 (0%) | 5/20 (25%) | 11/20 (55%) | 17/20 (85%) | 0/20 (0%) | 17/20 (85%) |
| `random_label_consistency` | 0/20 (0%) | 5/20 (25%) | 9/20 (45%) | 17/20 (85%) | 0/20 (0%) | 17/20 (85%) |

## Manual interpretation

- **No behavior-level improvement**: `consistency_loss` scored 0/20 on behavior correctness, the same as every baseline/control. It did not learn to say “selection sort,” “matrix multiply is buggy,” “string contains,” or “naive convolution” in the reviewed cases.
- **Complexity prose remained template-driven**: `consistency_loss` had slightly better time-prose accuracy than `no_consistency_loss` on this sample, but the gain came from generic templates whose complexity happened to match some examples. It was not accompanied by correct behavior descriptions.
- **Structured claims and prose decoupled**: The strongest qualitative failure mode is internal contradiction. The model often generated wrong prose such as “O(n) time” and then emitted the correct structured `time_complexity=O(n^2)` claim.
- **Negative controls were not clean failures**: `claim_only_pooling` also achieved high quantitative coupling and often looked similar in prose quality, which weakens the interpretation that explanation-token pooling uniquely improved explanations.

## Representative examples

### Sample 1: `check_all_pairs_equal`

Reference: Checks if all elements are equal via pairwise comparison. O(n^2) time, O(1) space.

| Variant | Generated prose | Manual judgment |
|---|---|---|
| `consistency_loss` | Computesn! it er atively. O(n) time, O(1) space. | wrong behavior, wrong time |
| `no_consistency_loss` | Computesn! it er atively. O(n) time, O(1) space. | wrong behavior, wrong time |
| `claim_only_pooling` | Computesn! it er atively. O(n) time, O(1) space. | wrong behavior, wrong time |
| `random_label_consistency` | Computesn! it er atively. O(n) time, O(1) space. | wrong behavior, wrong time |

### Sample 2: `generate_all_pairs`

Reference: Generates all ordered pairs. O(n^2) time and O(n^2) space.

| Variant | Generated prose | Manual judgment |
|---|---|---|
| `consistency_loss` | Re turnsuniqueelementspre se rv in gord er. O(n) time and O(n) space. | wrong behavior, wrong time, wrong space |
| `no_consistency_loss` | Computesthesum of a list withas in glepass. Timecom pl exity O(n), sp a ce O(1). | wrong behavior, wrong time, wrong space |
| `claim_only_pooling` | Re turnsuniqueelementspre se rv in gord er. O(n) time and O(n) space. | wrong behavior, wrong time, wrong space |
| `random_label_consistency` | Computestheabsolutevalue of anumb er in constant time and space. | wrong behavior, wrong time, wrong space |

### Sample 5: `matrix_multiply_buggy`

Reference: Attempts 2x2 matrix multiplication but is buggy (missing accumulation). O(n^2) time, O(1) auxiliary space.

| Variant | Generated prose | Manual judgment |
|---|---|---|
| `consistency_loss` | Checks if astr in gconta in sa ch aract er vial in earscan. O(n) time, O(1) space. | wrong behavior, wrong time, wrong bug status |
| `no_consistency_loss` | Computesprefixsumarray. O(n) time and O(n) space. | wrong behavior, wrong time, wrong space, wrong bug status |
| `claim_only_pooling` | Computesn! it er atively. O(n) time, O(1) space. | wrong behavior, wrong time, wrong bug status |
| `random_label_consistency` | Computesprefixsumarray. O(n) time and O(n) space. | wrong behavior, wrong time, wrong space, wrong bug status |

### Sample 7: `is_sorted`

Reference: Checks whether a list is sorted in ascending order. O(n) time, O(1) space.

| Variant | Generated prose | Manual judgment |
|---|---|---|
| `consistency_loss` | Computesprefixsumarray. O(n) time and O(n) space. | wrong behavior, wrong space |
| `no_consistency_loss` | Computesprefixsumarray. O(n) time and O(n) space. | wrong behavior, wrong space |
| `claim_only_pooling` | Computesprefixsumarray. O(n) time and O(n) space. | wrong behavior, wrong space |
| `random_label_consistency` | Computesprefixsumarray. O(n) time and O(n) space. | wrong behavior, wrong space |

### Sample 15: `string_reverse`

Reference: Reverses a string character by character. O(n) time, O(n) space due to string accumulation.

| Variant | Generated prose | Manual judgment |
|---|---|---|
| `consistency_loss` | Checks if allelementsareequalviapairwi se comparison. O( n ^ 2 ) time, O(1) space. | wrong behavior, wrong time, wrong space |
| `no_consistency_loss` | Computesprefixsumarray. O(n) time and O(n) space. | wrong behavior |
| `claim_only_pooling` | Computesprefixsumarray. O(n) time and O(n) space. | wrong behavior |
| `random_label_consistency` | Computesprefixsumarray. O(n) time and O(n) space. | wrong behavior |

### Sample 20: `naive_convolution`

Reference: Naive polynomial/array convolution. O(n^2) time, O(n) output space.

| Variant | Generated prose | Manual judgment |
|---|---|---|
| `consistency_loss` | Checks if allelementsareequalviapairwi se comparison. O( n ^ 2 ) time, O(1) space. | wrong behavior, wrong space |
| `no_consistency_loss` | Checkswheth er an in teg er isevenus in gthemoduloop er ator. Runs in O(1) time and O(1) space. | wrong behavior, wrong time, wrong space |
| `claim_only_pooling` | Computesn! it er atively. O(n) time, O(1) space. | wrong behavior, wrong time, wrong space |
| `random_label_consistency` | Computestheabsolutevalue of anumb er in constant time and space. | wrong behavior, wrong time, wrong space |

## Per-sample notes

- **Sample 1, `check_all_pairs_equal`**: No variant described the actual behavior. Best partial score 2/4: consistency_loss, no_consistency_loss, claim_only_pooling, random_label_consistency.
- **Sample 2, `generate_all_pairs`**: No variant described the actual behavior. Best partial score 1/4: consistency_loss, no_consistency_loss, claim_only_pooling, random_label_consistency.
- **Sample 3, `selection_sort`**: No variant described the actual behavior. Best partial score 2/4: consistency_loss, no_consistency_loss, claim_only_pooling, random_label_consistency.
- **Sample 4, `selection_sort`**: No variant described the actual behavior. Best partial score 2/4: consistency_loss, no_consistency_loss, claim_only_pooling, random_label_consistency.
- **Sample 5, `matrix_multiply_buggy`**: No variant described the actual behavior. Best partial score 1/4: consistency_loss, claim_only_pooling.
- **Sample 6, `matrix_multiply_buggy`**: No variant described the actual behavior. Best partial score 1/4: consistency_loss, claim_only_pooling.
- **Sample 7, `is_sorted`**: No variant described the actual behavior. Best partial score 2/4: consistency_loss, no_consistency_loss, claim_only_pooling, random_label_consistency.
- **Sample 8, `insertion_sort`**: No variant described the actual behavior. Best partial score 2/4: claim_only_pooling.
- **Sample 9, `selection_sort`**: No variant described the actual behavior. Best partial score 2/4: consistency_loss, no_consistency_loss, claim_only_pooling, random_label_consistency.
- **Sample 10, `matrix_multiply_buggy`**: No variant described the actual behavior. Best partial score 1/4: consistency_loss, claim_only_pooling.
- **Sample 11, `sign`**: No variant described the actual behavior. Best partial score 2/4: consistency_loss, no_consistency_loss, claim_only_pooling.
- **Sample 12, `string_contains`**: No variant described the actual behavior. Best partial score 3/4: no_consistency_loss, claim_only_pooling, random_label_consistency.
- **Sample 13, `all_pairs_sum`**: No variant described the actual behavior. Best partial score 2/4: claim_only_pooling.
- **Sample 14, `max_of_two`**: No variant described the actual behavior. Best partial score 2/4: consistency_loss, no_consistency_loss, claim_only_pooling, random_label_consistency.
- **Sample 15, `string_reverse`**: No variant described the actual behavior. Best partial score 3/4: no_consistency_loss, claim_only_pooling, random_label_consistency.
- **Sample 16, `clamp`**: No variant described the actual behavior. Best partial score 2/4: no_consistency_loss, claim_only_pooling.
- **Sample 17, `compute_mean`**: No variant described the actual behavior. Best partial score 2/4: consistency_loss, no_consistency_loss, claim_only_pooling, random_label_consistency.
- **Sample 18, `prefix_sums`**: No variant described the actual behavior. Best partial score 2/4: no_consistency_loss, claim_only_pooling, random_label_consistency.
- **Sample 19, `string_contains`**: No variant described the actual behavior. Best partial score 3/4: no_consistency_loss, claim_only_pooling, random_label_consistency.
- **Sample 20, `naive_convolution`**: No variant described the actual behavior. Best partial score 2/4: consistency_loss.

## Conclusion

The manual review agrees with the BLEU/ROUGE result: the full run shows strong claim-token learning and representation-level coupling, but not a meaningful improvement in natural-language explanation correctness. The experiment as currently implemented demonstrates that consistency loss can make hidden states predictive of oracle claims, but it does not show that mismatched explanations become semantically aligned with the code.

A better follow-up experiment would train the explanation generator against corrected oracle-consistent explanations, add a prose-level verifier/reward, or measure explanations with a semantic classifier rather than relying on BLEU/ROUGE alone.