# LLM Pilot Outputs

This directory contains the scored outputs for the BioDimBench LLM pilot. The
pilot sampled 30 benchmark problems, exactly three per category, using seed 42.
Each problem had five model responses, for 150 requested generations.

Included files:

- `sampled_problems.csv`: sampled benchmark problems.
- `scored_llm_outputs.csv`: parsed and scored model responses.
- `manual_review_needed.csv`: rows flagged by the parser or scorer for review.
- `pilot_summary.json`: aggregate pilot counts used in Appendix B.

Raw free-form model responses are not included. The paper reports aggregate
pilot statistics and scored output fields.
