# Hansen Test (Fixed Budget, High Misranking) — Evidence Package

Goal: test fixed-budget sample efficiency under **high misranking** by comparing
selection-stage uncertainty integration (BERW / ProbeSwitch) against evaluation-stage
uncertainty reduction baselines (UH-CMA-ES and fixed-k resampling).

This package provides a **fixed-budget** head-to-head comparison against **UH-CMA-ES** (pycma `NoiseHandler`, Hansen 2009-based) in a deliberately harsh regime: **high misranking** functions, **D=40**, **B=100×D**.

## Setup (this run)

- Suite: COCO `bbob-noisy`
- Dimension: `D=40`
- Budget: `B=100×D` (total evals = `4000`)
- Instances: `1–15` (COCO standard)
- Functions: a **high-misranking** subset (by `rank_disagreement` on ES samples):
  - function ids: `{108,110,111,113,114,116,117,119,120,122,123,125,126,128,129}`
  - equivalently indices: `{8,10,11,13,14,16,17,19,20,22,23,25,26,28,29}`

Algorithms:
- Baseline: `CMA-ES-sep`
- Rival (evaluation-stage uncertainty reduction):
  - `UH-CMA-ES(maxevals=10)`
  - `UH-CMA-ES(maxevals=30)`
- Ours (selection-stage uncertainty integration):
  - `BERW-Hetero`
  - `ProbeSwitch-MR(t=0.12)`
- Extra baselines (fixed-k resampling, fixed budget):
  - `CMA-ES-Resample(k=5)`
  - `CMA-ES-Resample(k=10)`

Full reproduction: `python3 tools/reproduce_all.py --workers 4` (this evidence pack is a stable output target).

Source run directory (full logs): `Results/exp_hansen_money_highmisrank_d40_B100_i1_15/`

## Key outputs

### 1) The Money Plot (noise-free best vs evaluations)

- 2×2 figure (paper representative set: `f10,f13,f16,f25`):  
  - `evidence/hansen_test_fixed_budget/money_plot_noisefree_d40_B100_f10-13-16-25_with_resample.png`  
  - `evidence/hansen_test_fixed_budget/money_plot_noisefree_d40_B100_f10-13-16-25_with_resample.pdf`
- Legacy representative set (`f8,f10,f14,f20`):  
  - `evidence/hansen_test_fixed_budget/money_plot_noisefree_d40_B100_f8-10-14-20_with_resample.png`  
  - `evidence/hansen_test_fixed_budget/money_plot_noisefree_d40_B100_f8-10-14-20_with_resample.pdf`
- Per-function curves (median + IQR across instances, noise-free delta): `evidence/hansen_test_fixed_budget/moneyplot/`

Expected qualitative shape (what this evidence checks):
- `UH-CMA-ES` curves are **much shallower** under the fixed budget.
- `BERW-Hetero` drops **faster and lower** (more progress per evaluation).

### 2) Fixed-budget end performance + significance (noise-free)

Noise-free summary table (per (f, i), final best delta): `evidence/hansen_test_fixed_budget/noisefree/bbob_summary.csv`

Paired sign-tests (exact, two-sided): `evidence/hansen_test_fixed_budget/noisefree/pairwise_sign_test.csv`
(Read wins/ties/p-values directly from the CSV; it is regenerated by the full suite.)

Paired Wilcoxon signed-rank (normal approx):  
- `evidence/hansen_test_fixed_budget/noisefree/pairwise_wilcoxon_berw_hetero_vs_uh_cma_es_maxevals_30.json`
- `evidence/hansen_test_fixed_budget/noisefree/pairwise_wilcoxon_berw_hetero_vs_uh_cma_es_maxevals_10.json`

Paired bootstrap CI (percentile, Δlog10 best\_f; negative means BERW better):
- `evidence/hansen_test_fixed_budget/noisefree/pairwise_bootstrap_ci_berw_hetero_vs_uh_cma_es_maxevals_30.json`
- `evidence/hansen_test_fixed_budget/noisefree/pairwise_bootstrap_ci_berw_hetero_vs_uh_cma_es_maxevals_10.json`

Additional sign-tests including the fixed-k resampling baselines:
- `evidence/hansen_test_fixed_budget/noisefree/pairwise_sign_test_with_resample.csv`

Paired Wilcoxon (BERW vs fixed-k resampling, same fixed-budget slice):
- `evidence/hansen_test_fixed_budget/noisefree/pairwise_wilcoxon_berw_hetero_vs_resample_k10.json`
- `evidence/hansen_test_fixed_budget/noisefree/pairwise_wilcoxon_berw_hetero_vs_resample_k5.json`

Paired bootstrap CI (BERW vs fixed-k resampling):
- `evidence/hansen_test_fixed_budget/noisefree/pairwise_bootstrap_ci_berw_hetero_vs_resample_k10.json`
- `evidence/hansen_test_fixed_budget/noisefree/pairwise_bootstrap_ci_berw_hetero_vs_resample_k5.json`

### 3) Budget usage table (checkpoints)

Median noise-free best delta at eval checkpoints (`10D/25D/50D/100D`) for the 4 plotted functions:
- `evidence/hansen_test_fixed_budget/budget_usage_table_f8-10-14-20.csv`
- `evidence/hansen_test_fixed_budget/budget_usage_table_f8-10-14-20_with_resample.csv` (includes `k=5/10` resampling baselines)
- `evidence/hansen_test_fixed_budget/budget_usage_table_f10-13-16-25_with_resample.csv` (paper representative set)

### 4) Additional sample-efficiency summaries

Cross-run summaries (aggregated over the full (function,instance) slice):
- `evidence/hansen_test_fixed_budget/sample_efficiency/performance_by_budget.csv`
- `evidence/hansen_test_fixed_budget/sample_efficiency/hitting_time_by_relative_factor.csv`
- Same tables including fixed-k resampling baselines:
  - `evidence/hansen_test_fixed_budget/sample_efficiency_with_resample/performance_by_budget.csv`
  - `evidence/hansen_test_fixed_budget/sample_efficiency_with_resample/hitting_time_by_relative_factor.csv`

### 5) Residual-pool diagnostics (theory operationalization)

To make the theory’s “mismatch decomposition / diagnostics checklist” *measurable*,
we include internal BERW-Hetero state traces and summary statistics on the same fixed-budget scale:

- `evidence/hansen_test_fixed_budget/diagnostics/README.md`

## Reproduce

Run (creates COCO `exdata/` folders, fixed budget):

`python3 tools/run_coco_bbob_noisy_parallel.py --results-dir Results/_repro_hansen_money_fixed_budget --dims 40 --budgets 100 --functions 8,10,11,13,14,16,17,19,20,22,23,25,26,28,29 --instances 1-15 --algorithms "CMA-ES-sep,UH-CMA-ES(maxevals=10),UH-CMA-ES(maxevals=30),BERW-Hetero,ProbeSwitch-MR(t=0.12)" --tag hansen_money_fixed_budget --workers 4`

Money plot (noise-free deltas from exdata):

1) `python3 tools/extract_coco_traces.py --exdata-dirs <...paths...> --functions 110,113,116,125 --dims 40 --instances 1-15 --output-dir Results/_tmp_moneyplot_noisefree`
2) `python3 tools/make_hansen_money_plot.py --csv-dir Results/_tmp_moneyplot_noisefree/csv --functions 110,113,116,125 --dim 40 --output-prefix Results/_tmp_moneyplot_noisefree/money_plot`

Noise-free final stats:

`python3 tools/summarize_coco_noisefree_from_exdata.py --exdata-list Results/_repro_hansen_money_fixed_budget/exdata_dirs.txt --output-dir Results/_repro_hansen_money_fixed_budget/noisefree && python3 tools/pairwise_sign_test.py --results-dir Results/_repro_hansen_money_fixed_budget/noisefree`

Optional: add fixed-k resampling baselines for the 4 plotted functions:

`python3 tools/run_coco_bbob_noisy_parallel.py --results-dir Results/_repro_hansen_money_resample_k5k10 --dims 40 --budgets 100 --functions 10,13,16,25 --instances 1-15 --algorithms "CMA-ES-Resample(k=5),CMA-ES-Resample(k=10)" --tag hansen_money_resample_k5k10 --workers 4`

To reproduce the full-slice resampling significance table (`pairwise_sign_test_with_resample.csv`), run the same command but with:

`--functions 8,10,11,13,14,16,17,19,20,22,23,25,26,28,29`

Then merge `exdata_dirs.txt` from both runs and re-run trace extraction / money plot generation (see `evidence/hansen_test_fixed_budget/exdata_dirs_with_resample.txt` for an example path list).
