# KataGo Large Dataset Builder

This folder is the large-dataset path for building a frozen human-commentary Go dataset from `gtlreviews`.

Design choices:

- Commentary source: local SGF `C[...]` comments from `gtlreviews`
- Oracle: KataGo analysis engine
- Visits: intended default `2000`
- Output: reusable frozen dataset, separate from training

Why local SGF comment mining first:

- `gtlreviews` already contains abundant move-level human commentary.
- It avoids introducing a Supabase dependency into the first-pass builder.
- The builder can be run entirely from the local corpus or on Modal.

Core file:

- `katago_large_dataset_builder.py`

What it does:

1. Mine positions that already have human commentary from SGF nodes.
2. Query KataGo analysis-engine JSON for each selected position.
3. Derive the 12 structured claim fields.
4. Write:
   - `full.jsonl`
   - `train.jsonl`
   - `eval.jsonl`
   - `dataset_manifest.json`

Important note:

- This builder is preprocessing-only.
- It is meant to run once and produce a frozen dataset for future finetuning/training jobs.

Suggested local invocation:

```bash
python katago/katago-large/katago_large_dataset_builder.py \
  --sgf-dir gtlreviews \
  --output-dir /tmp/katago_large_dataset_v1 \
  --max-positions 10000 \
  --max-positions-per-game 8 \
  --katago-binary /path/to/katago \
  --katago-model /path/to/model.bin.gz \
  --katago-config /path/to/analysis_example.cfg \
  --katago-visits 2000
```

Suggested Modal invocation:

```bash
export OUTPUT_STEM=katago_large_dataset_v1
export DATASET_MAX_POSITIONS=10000
export DATASET_MAX_POSITIONS_PER_GAME=8
export KATAGO_ANALYSIS_VISITS=2000
export KATAGO_INFLIGHT_LIMIT=64

python -m modal run katago/katago-large/modal_katago_large_dataset_build.py
```
