"""Distill the V3-SC LLM judge into a fast learned span classifier.

Stages:
    mint_silver       - random-sample spans, mint V3-SC labels (~10k)
    prepare_dataset   - group-split by trace, build features
    train_logreg      - TF-IDF + LogReg baseline
    evaluate          - in-dist test + R1-SC calibration set
    ood_math/         - OOD evaluation on AIME/Olympiad math traces
"""
