# Sudoku reasoning (CoT / SCoT)

This code generates deterministic Sudoku solver traces, trains decoder-only models on them (CoT and SCoT), and evaluates by autoregressive generation.

All commands below are meant to be run from this directory (`supplementary/sudoku/`).

## Setup

Create a conda environment from `environment.yml`:

```bash
conda env create -f environment.yml
conda activate sudoku-reasoning
```

Training requires exactly 1 CUDA GPU (`train.py` exits otherwise). CoT evaluation uses vLLM.

## Data

This supplementary package does not ship the puzzle files.

Download the raw CSVs and convert them to plain text:

```bash
mkdir -p data
curl -L -o data/train.csv 'https://huggingface.co/datasets/sapientinc/sudoku-extreme/resolve/main/train.csv?download=true'
curl -L -o data/test.csv 'https://huggingface.co/datasets/sapientinc/sudoku-extreme/resolve/main/test.csv?download=true'
python -m sudoku_reasoning.clean_sudoku_extreme
```

This produces:
- `data/train.txt`: training puzzles
- `data/test.txt`: evaluation puzzles

Each file contains one puzzle per line as 81 characters:
- digits `1..9` are givens
- `0` or `.` are blanks

## Precompute test CoT lengths (required for eval)

```bash
python -m sudoku_reasoning.precompute_cot_lengths \
  --input data/test.txt \
  --output data/test_cot_lengths.json \
  --max-solver-steps 10000000 \
  --workers 32
```

## Generate training data

Generate one capped dataset per mode (cap: `2^14 = 16384` tokens of full CoT length suffices for all results in the paper):

```bash
python -m sudoku_reasoning.generate_train_data \
  --mode cot \
  --input data/train.txt \
  --output data/train_data/cot_16k \
  --max-cot-length 16384 \
  --workers 32

python -m sudoku_reasoning.generate_train_data \
  --mode scot \
  --input data/train.txt \
  --output data/train_data/scot_16k \
  --segment-trace-tokens 512 \
  --max-cot-length 16384 \
  --workers 32
```

Each dataset row stores:
- `input_ids`, `loss_mask`, `length` (segment length), `cot_length` (full CoT length of the underlying puzzle)

## Train

Train on a subset of the dataset by filtering on `cot_length` with `--max-cot-tokens` (e.g. set it to `8192` to train only on shorter puzzles).

CoT example (6 layers, width 512, cap 16k). 
It is advisable to use gradient accumulation as below for long CoT runs, which groups samples for each update by length into multiple minibatches.
This was done with --grad-accum 4 for each CoT run in the Paper. 
Average tokens per update in the example below is 4*25000 = 100000, but only approximately as the samples per minibatch is rounded to a whole number.
 

```bash
python -m sudoku_reasoning.train \
  --train-data data/train_data/cot_16k \
  --output-dir checkpoints/cot_l6_16k \
  --tokens 20000000000 \
  --tokens-per-batch 25000 \
  --grad-accum 4 \
  --learning-rate 5e-4 \
  --floor-factor 0.02 \
  --num-workers 16 \
  --num-layers 6 \
  --num-heads 8 \
  --hidden-size 512 \
  --max-cot-tokens 16384
```

SCoT example (same model, trained on SCoT segments). Here no gradient accumulation is required as segments have similar length.:

```bash
python -m sudoku_reasoning.train \
  --train-data data/train_data/scot_16k \
  --output-dir checkpoints/scot_l6_16k \
  --tokens 20000000000 \
  --tokens-per-batch 100000 \
  --learning-rate 5e-4 \
  --floor-factor 0.02 \
  --num-workers 16 \
  --num-layers 6 \
  --num-heads 8 \
  --hidden-size 512 \
  --max-cot-tokens 16384
```

## Evaluate

Evaluation expects `data/test_cot_lengths.json` (generated above).

CoT evaluation uses vLLM:

```bash
python -m sudoku_reasoning.eval_cot checkpoints/cot_l6_16k \
  --data-path data/test.txt \
  --count 10000 \
  --max-new-tokens 32512
```

SCoT evaluation (segment-wise generation):

```bash
python -m sudoku_reasoning.eval_scot checkpoints/scot_l6_16k \
  --data-path data/test.txt \
  --count 10000 \
  --batch-size 128 \
  --max-new-tokens 10000000 \
  --max-segment-length 2048
```

Target-based evaluation (writes logs for plotting):

```bash
python -m sudoku_reasoning.eval_cot checkpoints/cot_l6_16k \
  --data-path data/test.txt \
  --per-target 100 \
  --targets-logspace 512 23170 23 \
  --max-new-tokens 32512 > eval_cot_targets.log

python -m sudoku_reasoning.eval_scot checkpoints/scot_l6_16k \
  --data-path data/test.txt \
  --batch-size 128 \
  --per-target 100 \
  --targets-logspace 512 23170 23 \
  --max-new-tokens 65536 \
  --max-segment-length 2048 > eval_scot_targets.log
```

Hardest-100 evaluation: use a single large target (e.g. `10000000`) and `--per-target 100`, which selects the 100 puzzles with CoT lengths closest to that target (i.e., the longest ones if the target exceeds all CoT lengths):

```bash
python -m sudoku_reasoning.eval_scot checkpoints/scot_l6_16k \
  --data-path data/test.txt \
  --batch-size 128 \
  --per-target 100 \
  --targets 10000000 \
  --max-new-tokens 15000000 \
  --max-segment-length 2048 > eval_scot_hardest100.log
```

Plot target results:

```bash
python -m sudoku_reasoning.plot_eval_results \
  eval_cot_targets.log eval_scot_targets.log \
  --labels CoT SCoT \
  --output eval_targets.png \
  --logx

```
