# Autocomp Baseline Comparisons

Purpose: establish whether Autocomp's structured search (beam + strategy menu + dropout + feedback) is actually necessary, or whether simpler approaches with equal LLM budget produce equivalent results.

## Benchmark subset

All methods start from the **JAX/XLA baseline** (`baseline.py`) and attempt to translate + optimize into Pallas. Hand-tuned Pallas kernels (where available) serve purely as performance references.

**5-kernel subset (Gemini 3.1 Pro):** A representative subset for detailed analysis and model capacity ablation. Selected for architectural diversity and difficulty range:

| Benchmark | Task | XLA (ms) | Autocomp (ms) | Autocomp speedup | Human Pallas ref | Why picked |
|---|---|---|---|---|---|---|
| `1p_Flash_Attention` | JAX→Pallas | 24.89 | 6.032 | 4.13x | — | Memory-bound attention; classic kernel optimization target |
| `12p_RMSNorm` | JAX→Pallas | 1.22 | 0.859 | 1.42x | — | Simple kernel; all methods succeed (control) |
| `5p_Flex_Attention` | JAX→Pallas | 36.08 | 9.295 | 3.88x | — | Attention variant; opt-iterations-dominated |
| `15p_RetNet_Retention` | JAX→Pallas | 12.77 | 1.926 | 6.63x | — | Emerging architecture; translation-dominated |
| `16p_Mamba2_SSD` | JAX→Pallas | 28.91 | 6.578 | 4.39x | — | Hard structural rewrite; baselines fail |

**Full 50 benchmarks (Gemini 3 Flash):** All 50 JAXBench workloads. Time and cost constraints make running the full suite with the larger model impractical.

## Budget

- **N = 144 samples** per baseline per benchmark (matches Autocomp's full 2-phase nominal budget: 4 translate iters + 4 opt iters × 3 beam × 6 samples = 72+72).
- **Model:**
  - **5-kernel subset:** Gemini 3.1 Pro (all methods use the same model).
  - **Full 50:** Gemini 3 Flash (all methods use the same model).
- **Also report N=72** for Best-of-N as a mid-point, since Autocomp often uses fewer samples due to early stopping.

## Baselines

### Baseline 1: Best-of-N (parallel sampling)

- Generate N independent samples in one shot from `xla_baseline.py`.
- Prompt: "Convert this JAX kernel to Pallas on TPU v6e. Make it faster than the baseline. Return only Python code in a ```python``` block." No menu, no feedback, no chain.
- Evaluate all N, report best correct latency.
- Tests: does raw sample diversity suffice?

**Files:**
- `autocomp/baselines/best_of_n.py` — CLI: `python -m autocomp.baselines.best_of_n --prob_id 5p_Flex_Attention --n 144 --model gemini-3.1-pro-preview --output_dir output/baselines/best_of_n/5p_Flex_Attention/`
- Reuses `autocomp.agents.llm_client.LLMClient` for async batch sampling.
- Reuses `autocomp.backend.jaxbench.JAXBenchEvaluator` for evaluation.

### Baseline 2: Iterative refinement (single chain)

- Start from `xla_baseline.py`.
- At each step, prompt the LLM with: current code, its measured latency (or error), and "make it faster."
- Beam=1, no menu, no dropout. Pass best-so-far to next step.
- N iterations total (serial).
- Tests: does structured search beat naive self-refinement?

**Files:**
- `autocomp/baselines/iterative.py` — CLI: `python -m autocomp.baselines.iterative --prob_id 5p_Flex_Attention --n 144 --output_dir ...`
- Per-step prompt includes: code, score, stderr (if failed), and "optimize further."
- Saves candidate + eval per iteration for resumability.

### Baseline 3: Autonomous agent (mini-swe-agent)

- Uses [mini-swe-agent](https://github.com/SWE-agent/mini-swe-agent), a lightweight coding agent where every LLM turn issues bash commands.
- Agent gets the same workspace as other baselines: `baseline.py`, `solution.py`, `eval.sh`, and `PROMPT.md`.
- Runs in an isolated temp directory (no access to autocomp codebase).
- `eval.sh` budget enforced by `eval_single.py` (same 144-eval cap as other baselines).
- `step_limit` on LLM turns provides a generous upper bound; the eval budget is the real constraint.
- Unlike the previous Gemini CLI baseline, every bash command the agent runs (syntax checks, local tests, etc.) counts as an LLM turn, ensuring fair action accounting.
- Tests: does an autonomous agent with unconstrained tool use outperform structured search?

**Files:**
- `autocomp/baselines/mini_swe_harness.py` — CLI: `python -m autocomp.baselines.mini_swe_harness --prob_id 12p_RMSNorm --budget 144 --output_dir output/baselines/mini_swe/12p_RMSNorm/`
- Uses `vertex_ai/gemini-3.1-pro-preview` via litellm by default.
- Reuses `eval_sh_template.sh`, `PROMPT_TEMPLATE.md`, and `eval_single.py` from the shared infrastructure.

## Shared infrastructure

`autocomp/baselines/common.py`:
- `load_xla_baseline(prob_id) -> str`
- `evaluate(prob_id, code) -> dict(correct, latency, stdout, stderr)` — thin wrapper over `JAXBenchEvaluator`
- `summarize_run(output_dir) -> dict` — aggregates per-candidate results into `summary.json`

## Output layout

```
output/baselines/
├── best_of_n/
│   └── {prob_id}/
│       ├── candidates/candidate_{i}.txt      # generated code
│       ├── eval/code_{i}_result.txt          # {correct, latency, stderr}
│       ├── prompt.txt                        # the exact prompt used
│       └── summary.json                      # {n, n_correct, best_latency, runtime_s}
├── iterative/
│   └── {prob_id}/
│       ├── iter_{i}/candidate.txt
│       ├── iter_{i}/eval.txt
│       └── summary.json                      # trajectory + best
└── mini_swe/
    └── {prob_id}/
        ├── workspace/                         # isolated dir the agent edited
        ├── trajectory.jsonl                   # eval results per eval.sh call
        ├── mini_swe_trajectory.json           # full agent trajectory (messages)
        └── summary.json
```

## Results (5-kernel subset, Gemini 3.1 Pro)

All methods start from the JAX/XLA baseline and translate + optimize to Pallas. Budget: 144 samples per method per benchmark.

| Benchmark | XLA (ms) | Autocomp (ms) | Autocomp | Best-of-N @144 | Iterative @144 | Agent (mini-swe) |
|---|---:|---:|---:|---:|---:|---:|
| Flash Attention | 24.89 | 6.032 | **4.13x** | 0/144 correct | 8.09 ms (3.07x), 25/144 correct | — |
| RMSNorm | 1.22 | 0.859 | **1.42x** | 0.86 ms (1.42x), 10/144 correct | 0.86 ms (1.42x), 67/144 correct | — |
| Flex Attention | 36.08 | 9.295 | **3.88x** | 0/144 correct | 9.05 ms (3.99x), 29/144 correct | — |
| RetNet Retention | 12.77 | 1.926 | **6.63x** | 0/144 correct | 2.63 ms (4.85x), 1/144 correct | — |
| Mamba-2 SSD | 28.91 | 6.578 | **4.39x** | 0/144 correct | 0/144 correct | — |

**Notes:**
- All methods compared against the same XLA baseline per benchmark (Iterative's measurement).
- Iterative uses 18 parallel chains × 8 turns = 144 samples, with compile/correctness/profiler feedback.
- Flash Attention: Autocomp (6.03 ms, 4.13x) now beats Iterative (8.09 ms, 3.07x) after the rules/architecture prompt audit.
- Flex Attention: Autocomp (9.30 ms, 3.88x) and Iterative (9.05 ms, 3.99x) are within noise.
- Geomean speedup across 5 benchmarks: Autocomp **3.67x**, Iterative **2.43x**, Best-of-N **1.07x** (contributing 1x for unsolved benchmarks).

## Results (Full 50, Gemini 3 Flash)

Not yet run.

## Results (5-kernel subset, Gemini 3 Flash)

All methods start from the JAX/XLA baseline. Budget: 144 samples per method per benchmark.

| Benchmark | XLA (ms) | Best-of-N speedup | Iterative speedup | Iter+ctx speedup | Autocomp speedup |
|---|---:|---:|---:|---:|---:|
| RMSNorm            | 1.22  | 1.00x (0/144)   | 1.41x  | 1.41x (120/144) | 1.41x (59/115) |
| Flex Attention     | 36.58 | 1.00x (0/144)   | 1.00x  | 2.44x (69/144)  | **3.08x** (48/121) |
| RetNet Retention   | 12.90 | 1.00x (0/144)   | 1.00x  | 0.41x (9/144)   | **6.18x** (50/121) |
| Mamba-2 SSD        | 29.44 | 1.00x (0/144)   | 1.00x  | 2.90x (3/144)   | 1.00x (1/25) |
| Flash Attention    | 25.17 | 1.00x (0/144)   | 1.00x  | 0.46x (64/144)  | **2.68x** (55/121) |
| **Geomean** (floored at 1x) | | **1.00x**  | **1.07x** | **1.59x**    | **2.35x** |
| **Correctness**    |       | 0.0% (0/720)    | 1.3% (9/720) | 36.8% (265/720) | 42.3% (213/503) |

**Notes:**
- Iter+ctx = iterative refinement with Autocomp's full agent context (arch summary + per-benchmark-selected ISA + examples + rules) prepended to every turn. Same 18×8 chain/turn layout and same feedback block as plain iterative.
- Geomean is floored per-benchmark at 1x: a method that fails to beat XLA on a kernel contributes no improvement, not a slowdown. Raw per-kernel numbers (including the <1x entries for RetNet / Flash on iter+ctx) are shown above for transparency.
- Autocomp denominator is smaller (503) because of early stopping; compare rates, not absolute counts.
- Key finding: context alone (iter+ctx) closes most of the correctness gap (1.3% → 36.8%) and lifts geomean from 1.07x to 1.59x. Autocomp's structured search adds another 1.48x on top (1.59x → 2.35x), concentrated on RetNet/Flash where context-only iterative fails to beat XLA.

## Implementation order

1. `common.py` — eval wrapper + summary utilities
2. `best_of_n.py` + smoke-test on RMSNorm
3. `iterative.py` + smoke-test on RMSNorm
4. Run both across all 9 benchmarks at N=144
5. `mini_swe_harness.py` — agent baseline using mini-swe-agent; run on 5-kernel subset
6. Update `JAXBENCH_OPTIMIZATION_PLAN.md` results section with baseline comparisons

## Open questions (resolved)

- **Early stopping:** None. All baselines run the full N regardless of convergence, for fair budget comparison.
- **Failure accounting:** Failed candidates (compile errors, wrong output) count against the budget. Same rules as Autocomp.
- **Iterative failure handling:** On iter K failure, retry in place (feeding the failed code + stderr back and asking for a fix) up to M times before advancing. Retries count against the N budget. When all retries exhaust, advance with last-known-good code as the starting point and a note about the failed attempt.
- **Agent baseline:** Switched from Gemini CLI to mini-swe-agent. Gemini CLI's unrestricted shell access allowed "free" syntax checks and local tests not counted against the eval budget; mini-swe-agent counts every LLM turn (each of which issues bash commands), ensuring fair action accounting.
