First, set the local path in hydra_configs/env/default.yaml.

Then, we run the following distributional comparison experiment:
$ bash prompt_generation.sh
$ bash train_values.sh
$ bash eval_samplers.sh OracleLM
$ bash eval_samplers.sh OracleTW
$ bash eval_samplers.sh OracleJS
$ bash examine.sh

Note that OracleLM is the base model, OracleTW is ActionLevelRS, and OracleJS is VGB, all using true rewards at the final position.

The final script prints a list of dictionaries, where each entry corresponds to summary data for a particular algorithm with a particular random seed. Specifically, it contains 'hist_error' (l1 error of histogram compared to the estimated ground truth), 'frac_str' (fraction of 'str' objects in test cases), and 'frac_distinct' (fraction of distinct test cases generated).   
