To train base model:
$ bash lm_train_script.sh

Prefixes are generated with a difficulty parameter t: we draw t completions from the base model and only keep the prefix if all t completions have reward 0.

To run diversity/accuracy experiment, we chose t=60, and 40 training epochs, and ran the following pipeline:
$ bash prefix_generate.sh 60
$ bash dataset_generate.sh 60
$ bash train_values.sh 60 40
$ bash eval_sampler.sh 60 ${epoch} ${alg} ${K}
where we chose epoch in {1,2,3,5,10,40}, alg in {LM, TW, JS, MJS, BBoN2_1, BProp2_1, ...}, and K=3.
Note that LM is the base model, TW is ActionLevelRS, JS is VGB, MJS is VGB-Momentum, and BBoN2_1 / BProp2_1 is Block Best-of-N / Block Rejection Sampling with 2 candidate blocks of length 1.

To run the distributional comparison experiment, we swept over 20 <= t <= 60 and identified prefixes for which the base model had estimated accuracy between 0.01 and 0.06. We then repeated the above experiment on these t, except with algorithms {OracleLM, OracleJS, OracleMJS, OracleTW}, which use the true rewards at the last position of generation.

The results for an individual run are stored in a file with name format "./evals/.../n10000_t{t}_single/e{e}/evalset_n1000_t{t}_single/{alg}_K{K}.pkl"

Each of these files is a pickle file containing a Python list of dictionaries. Each dictionary corresponds to a single completion by the algorithm, and has the following format:
>>> {
	    'prefix_idx': prefix_idx,
	    'completion_idx': completion_idx,
	    'sequence': full_sequence,
	    'steps': step_count,
	    'reward': reward
}
where full_sequence is the sequence of token ids (including the prefix), step_count is the steps used by the algorithm for this completion, and reward is the binary reward of the completion.
