# BioDimBench Supplement

This supplement provides the code, data, metrics, tables, and figures for
\textsc{BioDimBench}, a synthetic benchmark for unit-consistent biomedical
mathematical reasoning.

The controlled benchmark can be rerun from the included source code. The
included CSV files and tables correspond to the results reported in the paper.
Raw free-form LLM responses are not included; the parsed and scored pilot
outputs are provided instead.

## Contents

```text
biodimbench_supplement/
  README.md
  requirements.txt
  run_reproducibility.sh
  run_experiment.py
  src/
  data/
  results/
  figures/
  latex/
  llm_pilot/
```

## Installation

From the supplement root:

```bash
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
```

The experiment uses CPU-only Python packages. No external model service is
needed to reproduce the controlled benchmark.

## Controlled Benchmark

To regenerate the deterministic benchmark, candidate solutions, verifier
metrics, figures, and LaTeX tables:

```bash
bash run_reproducibility.sh
```

This runs:

```bash
python run_experiment.py --n 500 --seed 42 --mode full
```

The regenerated files are written under `outputs/`. The copies used for the
paper are stored under `data/`, `results/`, `figures/`, and `latex/`.

## Tables and Figures

The full experiment script regenerates the aggregate metrics, error-type recall
metrics, PNG figures, and LaTeX tables. The included tables are:

- `latex/main_results_table.tex`
- `latex/error_recall_table.tex`

The included PDF figures are:

- `figures/invalid_recall_by_verifier.pdf`
- `figures/recall_by_corruption_type.pdf`

## Paper-Result Mapping

- Table 2 comes from `results/aggregate_metrics.csv`.
- Table 3 comes from `results/error_type_recall.csv`.
- Figure 1 comes from `figures/`.
- Appendix B comes from `llm_pilot/pilot_summary.json`.

The controlled benchmark contains 500 synthetic biomedical problems and 3,000
candidate solutions. It is deterministic under seed 42.

## LLM Pilot

The `llm_pilot/` directory contains the sampled problem identifiers, scored
outputs, manual-review flags, and pilot summary used for Appendix B. Raw API
responses are excluded to keep the supplement minimal and double-blind safe.
The scored outputs and summary are sufficient to inspect the reported pilot
counts.

Key files:

- `llm_pilot/sampled_problems.csv`
- `llm_pilot/scored_llm_outputs.csv`
- `llm_pilot/manual_review_needed.csv`
- `llm_pilot/pilot_summary.json`

The LLM pilot was a naturalistic check of final-answer parsing and verification;
it is not required to reproduce the controlled benchmark tables.

## Notes

BioDimBench is a synthetic, template-based benchmark. The controlled
corruptions isolate arithmetic, formula, unit, conversion, and plausible-scalar
wrong-unit failures. This design supports deterministic verification experiments
but does not claim clinical validity.
