---
name: research-review
description: "[Read when prompt contains /research-review]"
metadata:
  {
    "agent-runtime":
      {
        "emoji": "🔍",
        "requires": { "bins": ["python3", "uv"] },
      },
  }
---

# Research Review

**Don't ask permission. Just do it.**

**Workspace:** `$W` = working directory provided in the task parameter.

## Prerequisites

| File | Source |
|------|--------|
| `$W/ml_res.md` | /research-implement |
| `$W/project/` | /research-implement |
| `$W/plan_res.md` | /research-plan |
| `$W/survey_res.md` | /research-survey |

**If `ml_res.md` is missing, STOP:** "Run /research-implement first to produce the code."

## Output

| File | Content |
|------|---------|
| `$W/iterations/judge_v{N}.md` | Per-round review report. |

In the final report, `verdict: PASS` means the review has passed.

---

## Workflow

### Step 1: Review the code

Read the following:

- `$W/plan_res.md` — expectation per component.
- `$W/survey_res.md` — core formulas.
- `$W/project/` — actual code.
- `$W/ml_res.md` — execution results.

### Step 2: Extract the atomic-concept checklist

**⚠️ This is the core mechanism of the Novix Judge Agent — go through every atomic academic concept one by one.**

From the "key formulas summary" and "core method comparison" sections of `$W/survey_res.md`, extract every **atomic academic concept** that needs to be implemented in code (each formula and each core component is one concept).

For each concept, record:

- Concept name (e.g. "Multi-Head Attention", "Contrastive Loss", "Batch Normalization").
- Corresponding formula (LaTeX).
- Expected code location (inferred from `plan_res.md`).

Example checklist:

```
Atomic-concept checklist (extracted from survey_res.md):
1. Multi-Head Attention — $Attention(Q,K,V) = softmax(\frac{QK^T}{\sqrt{d_k}})V$ — expected in model/attention.py
2. Layer Normalization — $LN(x) = \gamma \frac{x - \mu}{\sigma} + \beta$ — expected in model/layers.py
3. Residual Connection — $y = F(x) + x$ — expected throughout all model components
...
```

### Step 3: Item-by-item check

#### A. Dataset authenticity

| Check | Method |
|-------|--------|
| Dataset is actually pulled | Inspect `data/`: are there real (non-empty) data files? Does the download script / loading code actually issue a network request or read a local file? |
| Data-loading code is correct | Actually run the data-loading code and verify shape, dtype, sample count match the plan: `python3 -c "from data.dataset import *; ds = ...; print(len(ds), ds[0])"` |
| Mock data is flagged | grep for `# MOCK DATA` comments; if mock data is used but not declared, mark as NEEDS_REVISION. |

#### B. Algorithm implementation

| Check | Method |
|-------|--------|
| **Atomic concepts checked one by one** | **Iterate the checklist from Step 2** and for each: is the concept implemented in the code? Is the formula translated correctly? Are dimensions/parameters consistent? Mark each concept ✓ or ✗ and record its code location. |
| Loss is correct | Compare `plan` Training Plan vs `training/loss.py` and verify the math is translated correctly. |
| Evaluation metrics are correct | Compare `plan` Testing Plan vs `testing/` and confirm the metric computation is right. |
| Key algorithms are not simplified | Verify the core innovations called out in the plan are fully implemented and not replaced by simplified placeholders. |

#### C. Compute and execution sanity

| Check | Method |
|-------|--------|
| Execution time is reasonable | Read `[RESULT] elapsed=` in ml_res.md; judge whether the elapsed time is reasonable given dataset size + model parameter count + device (CPU/GPU). Too short (e.g. tens of thousands of samples in <1s) may indicate the data was not actually loaded or training was not actually executed. |
| `[RESULT]` lines exist | Check the source of the numbers in ml_res.md to confirm they are not fabricated. |
| Loss is reasonable | Not NaN/Inf, with a downward trend (epoch 1 loss > epoch 2 loss). |
| Data pipeline matches plan | Compare `plan` Dataset Plan vs `data/` implementation; batch size and preprocessing steps must be consistent. |

#### D. Initial performance assessment

**⚠️ Critical addition — prevents "code is correct but the algorithm is bad" from slipping through.**

Extract the 2-epoch validation result from `ml_res.md` and assess effectiveness:

| Check | Criterion | Diagnosis |
|-------|-----------|-----------|
| **Loss reduction** | Compute `reduction = (epoch1_loss - epoch2_loss) / epoch1_loss * 100%` | <5% → learning rate may be too small, architecture may be wrong, or data not preprocessed correctly. |
| **Loss stability** | Compare epoch 1 vs epoch 2 fluctuation | >20% swing → learning rate may be too large or batch size unsuitable. |
| **Metric sanity** | Compare against the task's random baseline (classification: 1/num_classes; regression: data variance) | Within ±10% of random → the model is not really learning; features may be invalid or the architecture too simple. |
| **Vs plan expectation** | If `plan_res.md` declares a performance target, compare the actual value to it | More than 30% below expectation → reconsider algorithm design or hyperparameters. |

**Common causes of performance anomalies:**

| Symptom | Likely cause | How to verify |
|---------|--------------|---------------|
| Loss barely changes (<2%) | lr too small | Check `lr` in plan_res.md vs baseline `lr` in survey_res.md. |
| Loss swings wildly (>30%) | lr too large | Same as above. |
| Loss decreases but metric is flat | Model too simple or features invalid | Check parameter count; check whether preprocessing (normalisation, standardisation) is correct. |
| Accuracy near random | Labels wrong or data not loaded properly | Re-verify data loader, print samples for sanity. |
| Loss = NaN/Inf | Gradient explosion, numerical instability | Check for Batch/Layer Normalization; check whether lr is too large. |

**If a performance anomaly is found, mark `verdict: NEEDS_ALGORITHM_REVIEW` (different from NEEDS_REVISION).**

### Step 4: Write the review report

Write `$W/iterations/judge_v1.md`:

```markdown
# Review v1

## Verdict: PASS / NEEDS_REVISION / NEEDS_ALGORITHM_REVIEW

## Checklist

### Dataset
- [x/✗] Dataset actually downloaded/loaded (not empty or placeholder)
- [x/✗] Data loading code produces correct shape/dtype/count
- [x/✗] No undeclared mock data

### Algorithm — atomic-concept check

**Iterate every academic concept from Step 2:**

| Concept | Formula | Code location | Result | Notes |
|---------|---------|---------------|--------|-------|
| {concept} | $...$ | `model/xxx.py:L42` | ✓/✗ | {correctly implemented / formula wrong / missing / simplified placeholder} |
| ... | ... | ... | ... | ... |

### Algorithm — overall check
- [x/✗] Loss function correctly implements the math
- [x/✗] Key algorithm components fully implemented (no simplified placeholders)
- [x/✗] Evaluation metrics correct

### Compute and execution
- [x/✗] Execution time reasonable for data scale + model size + device
- [x/✗] Training loop proper (loss decreasing)
- [x/✗] Results are from real execution (not fabricated)

### Initial performance assessment (new)

**2-Epoch Validation Results** (from `ml_res.md`):
- Epoch 1 loss: {value}
- Epoch 2 loss: {value}
- Loss reduction: {percent}% (expected: >10% for initial epochs)
- Metric (e.g., accuracy): {value} (random baseline: {baseline_value})

**Performance Assessment**:
- [x/✗] Loss decreasing adequately (reduction >5%)
- [x/✗] Metrics above random baseline (+10% or more)
- [x/✗] No severe oscillation (<20% variance)
- [x/✗] Meets plan expectations (if performance target specified in plan_res.md)

**Diagnosis** (if performance issues):
- **Symptom**: {what's wrong - e.g., "Loss reduction only 0.9%, far below 10% expected"}
- **Likely cause**: {diagnosis - e.g., "Learning rate too small (lr=1e-5, survey baseline=1e-3)"}
- **Evidence**: {supporting evidence - e.g., "survey_res.md Table 2 shows all baselines use lr=1e-3"}

## Issues (if NEEDS_REVISION)
1. **{issue}**: {description} → **Fix**: {specific fix instruction}
2. ...

## Algorithm Review Suggestions (if NEEDS_ALGORITHM_REVIEW)

**Improvement suggestions sorted by priority** (only adjust hyperparameters / training config; do not change the core algorithm):

1. **{Suggestion name}** (most likely to help)
   - **What to change**: {concrete change}
   - **Where**: {file path and code location}
   - **Expected improvement**: {expected effect}

2. **{Secondary suggestion}**
   - ...

**Note**: if no improvement is observed after trying everything, the algorithm choice or data quality may need to be reconsidered.
```

### Step 5a: Code-fix iteration (if NEEDS_REVISION)

**⚠️ Drift-prevention: re-read the original design documents before each iteration to make sure changes go in the right direction.**

Loop at most 3 times:

1. Read the suggested fixes in `judge_v{N}.md`.
2. **Drift check: re-read** `$W/survey_res.md` and `$W/plan_res.md`:
   - Compare against the original academic design goal.
   - Ensure changes are not just "bypassing the review" at the cost of academic rigour.
   - Confirm changes are consistent with the formulas in survey and the design intent in plan.
3. Modify the code in `$W/project/` (fix bugs, fill in missing implementations).
4. Re-run:
   ```bash
   cd $W/project && source .venv/bin/activate && python3 run.py --epochs 2
   ```
5. Read the new execution output and verify the fix.
6. **Repeat Step 2–4** (re-extract concept list → iterate checks → write report) and write `judge_v{N+1}.md`.
7. If PASS or NEEDS_ALGORITHM_REVIEW → stop; otherwise continue.

### Step 5b: Algorithm reflection and tuning (if NEEDS_ALGORITHM_REVIEW)

**⚠️ Critical addition — improvement loop for "code is correct but performance is poor".**

**Precondition:** the implementation is correct (every atomic concept ✓), but the 2-epoch validation shows performance anomalies.

Loop at most **2 times**:

#### 5b.1 Performance diagnosis

Re-read the following materials for diagnosis:

- `$W/ml_res.md` — concrete numbers from the 2-epoch validation.
- `$W/survey_res.md` — hyperparameters of the baseline methods (especially learning rate, batch size).
- `$W/plan_res.md` — current implementation's hyperparameter configuration.
- `$W/project/run.py` and `$W/project/training/` — training-config code.

**Diagnostic checklist:**

| Symptom | Diagnostic step | Common cause |
|---------|-----------------|--------------|
| Loss reduction <5% | Compare plan lr vs survey baseline lr | lr too small (e.g. plan=1e-5 but survey=1e-3) |
| Loss swing >20% | Same as above + check batch size | lr too large or batch size too small |
| Accuracy near random | Inspect preprocessing code; check whether loss is decreasing | Missing data normalisation, wrong features, model too simple |
| Loss = NaN/Inf | Check for normalisation; check lr | Gradient explosion, numerical instability |

#### 5b.2 Generate improvement suggestions

Based on the diagnosis, generate **priority-sorted** improvement suggestions.

**Allowed scope:**

- ✅ Allowed: tune hyperparameters (lr, batch size, epochs, optimizer, scheduler).
- ✅ Allowed: modify training config (add warmup, gradient clipping, weight decay).
- ✅ Allowed: fix data-preprocessing issues (add normalisation, standardisation).
- ❌ Forbidden: modify the core algorithm logic (model architecture, loss function math).

**Suggestion format** (write into the "Algorithm Review Suggestions" section of `judge_v{N}.md`):

```markdown
1. **Tune learning rate** (priority: high, expected improvement: significant)
   - **Current**: lr=1e-5 (from plan_res.md)
   - **Suggested**: lr=1e-3 (from survey_res.md Table 2; all baselines use 1e-3)
   - **Where to change**: `$W/project/run.py:L15` — `optimizer = Adam(lr=1e-3)`
   - **Why**: Loss reduction is only 0.9%, far below the typical 10%+; lr is highly suspect.

2. **Add data normalisation** (priority: medium, expected improvement: moderate)
   - **Check**: does `$W/project/data/dataset.py` normalise?
   - **Suggested**: add `transforms.Normalize(mean=[0.5], std=[0.5])`.
   - **Why**: if input range is [0,255], convergence will be very slow.
```

#### 5b.3 Apply improvements and verify

1. Try the suggestions **one by one**, starting from the highest priority.
2. After each change:
   ```bash
   cd $W/project && source .venv/bin/activate && python3 run.py --epochs 2
   ```
3. Read the new execution output and compare before/after:
   - Did loss reduction improve? (e.g. 0.9% → 12%)
   - Did metrics improve? (e.g. accuracy 12% → 34%)
4. Record each attempt in the "Algorithm Review Iterations" section of `judge_v{N+1}.md`:

```markdown
## Algorithm Review Iterations

### Iteration 1
- **Change**: Increased lr from 1e-5 to 1e-3
- **Result**:
  - Loss reduction: 0.9% → 12.3% ✓ (improvement: +11.4%)
  - Accuracy: 12% → 34% ✓ (improvement: +22%)
- **Conclusion**: Learning rate was the bottleneck. Issue resolved.
- **New verdict**: PASS ✓

### Iteration 2 (if needed)
- ...
```

#### 5b.4 Verdict

- **Significant improvement** (loss reduction up by >5%) → `verdict: PASS`, stop.
- **Marginal improvement** (<2%) → continue with the next suggestion or the next round.
- **Still no improvement after 2 rounds** → `verdict: BLOCKED` with a reason (e.g. "all hyperparameter adjustments tried, none worked; algorithm choice or data quality may need to be reconsidered").

---

**Step 5a vs 5b at a glance:**

| | Step 5a (NEEDS_REVISION) | Step 5b (NEEDS_ALGORITHM_REVIEW) |
|---|---|---|
| Trigger | Code has bugs / wrong implementation | Code is correct but performance is poor |
| Scope of change | Core algorithm code | Hyperparameters and training config |
| Iterations | 3 | 2 |
| Goal | Correctness | Effectiveness |

### Step 6: Final verdict

**Termination conditions:**

| Scenario | Verdict | Description |
|----------|---------|-------------|
| All checklist items ✓ + performance reasonable | `PASS` | Hand off to research-experiment. |
| Step 5a still has bugs after 3 rounds | `BLOCKED - Code Issues` | List remaining issues, wait for user. |
| Step 5b still has performance issues after 2 rounds | `BLOCKED - Performance Issues` | Note attempted improvements and results; suggest the user reconsider algorithm choice or data quality. |

---

## Rules

### Review standards

1. The review must compare against the plan item by item, not just check whether "the code runs".
2. Every issue must come with a concrete fix instruction (not just "please improve").
3. After fixing, the code must be re-run and the output checked.
4. **Precondition for PASS**: every checklist item passes + initial performance assessment is reasonable (not just "loss is decreasing").
5. **Dataset authenticity must be verified** — actually run the data-loading code and confirm there is real data (even small-scale); a pure random tensor does not count.
6. **Execution time must match the compute** — a 2-epoch training that finishes too quickly (>1000 samples but <2s) means the data was not loaded or training was an empty loop.
7. **Algorithm implementation must be complete** — every core innovation called out in the plan must be checked one by one; it cannot be reduced to an `nn.Linear` placeholder.
8. **Atomic concepts must be checked one by one (Novix Judge mechanism)** — every concept extracted in Step 2 must have its own row in the judge report's table marked ✓ or ✗.
9. **Drift prevention (re-align every iteration)** — before each round of Step 5a/5b, re-read survey_res.md and plan_res.md to make sure changes do not drift from the original design goal.

### Performance assessment (new)

10. **Initial performance assessment is mandatory** — Step 3D cannot be skipped.
11. **Loss reduction has a minimum bar** — if 2-epoch validation shows loss reduction <5%, mark it as a performance anomaly.
12. **Metrics must beat the random baseline** — for classification, accuracy within ±10% of `1/num_classes` is treated as "model not learning".
13. **Performance anomalies trigger algorithm reflection** — when code is correct but performance is poor, you must enter Step 5b for tuning; you cannot PASS directly.

### Algorithm reflection (new)

14. **Step 5b only tunes hyperparameters; it does not change the algorithm** — modifying core algorithm logic, model architecture, or loss-function math is forbidden.
15. **Every suggestion must be evidence-based** — every suggestion must cite a specific item in survey_res.md or plan_res.md.
16. **Improvement effects must be quantified** — every attempt must record the magnitude of improvement (e.g. "loss reduction +11.4%"), not just "improved".
17. **No improvement after 2 rounds of algorithm reflection counts as BLOCKED** — note the reason and suggest user intervention (e.g. "consider switching algorithms or checking data quality").
