This contains the zero-shot dataset (Fig. 3) reported in our paper.
Due to space limitations on the uploads, only batch0 and batch1
contain results from the AutoEval pipeline (ie they contain
the llm responses and verifier results)
