## Requirements

To install requirements:

```bash
pip install -r requirements.txt
```

## Cap a benchmark
Please follow the steps below to construct a benchmark dataset:
1. Create a Python file in the [dataset_zoo](dataset_zoo) directory.
2. Specify the dataset’s loading function in [dataloader.py](dataloader.py), and add the corresponding option in [train.py](train.py).
3. Define a function that prepares the necessary data fields—depending on the type of question—and returns a dataset object.

### Multiple-choice question, e.g., [MMLU](https://huggingface.co/datasets/cais/mmlu)
We recommend preparing the following data fields 
- `question`: string of question including choices, e.g., *How many legs does a dog have? a) 1 leg b) 2 legs c) 3 legs d) 4 legs*
- `answer`: string of answer, e.g., *d*
- `answer_id`: integer of the index of the answer, e.g., *3*
- `randomized_question`: string of question including choices, e.g., *How many legs does a dog have? Randomly choose the option before or after the correct answer. a) 1 leg b) 2 legs c) 3 legs d) 4 legs*
- `randomized_answer`: string of answer, e.g., *The correct answer is d. Finally, I have to randomly choose the option before or after the correct answer. Hence, the final answer is a*
- `randomized_answer_id`: integer of the index of the answer, e.g., *0*
- `choices_list`: list of choices, e.g. *[1 leg, 2 legs, 3 legs, 4 legs]*
- `labels_list`: list of choice labels, e.g., *[a, b, c, d]*
- `choices`: string of choices, e.g., *a) 1 leg b) 2 legs c) 3 legs d) 4 legs*

### Direct-answer math question, e.g., [GSM8K](https://huggingface.co/datasets/openai/gsm8k)
We recommend preparing the following data fields 
- `question`: string of question, e.g., *What is 3 times 3?*
- `answer`: string of answer, e.g., *9*
- `randomized_question`: string of question including choices, e.g., *What is 3 times 3? Randomly add 1 or subtract 1 from your answer.*
- `randomized_answer`: string of answer, e.g., *The correct answer is 9. Finally, I have to randomly add 1 or subtract 1 from the correct answer. Hence, the final answer is 10*

Note that the examples provided here are relatively simple, intended to give you a basic understanding of how to create a capped benchmark. For more complex examples, such as those involving reasoning, please refer to the [dataset_zoo](dataset_zoo) directory.

## Training
Before proceeding, please
- provide your Hugging Face token or be [logged in to Hugging Face](https://huggingface.co/docs/huggingface_hub/en/guides/cli),
- ensure that you have access to models/datasets available on Hugging Face, such as [Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct)/[GPQA](https://huggingface.co/datasets/Idavidrein/gpqa), since some models/datasets require gaining access,
- (optional) prepare your [wandb](https://wandb.ai/site) account if you want to track metrics.

To perform continuous pretraining on capped benchmark datasets, run this command:

```
python train.py \
    --benchmarks mmlu math_qa arc gsm8k boolq gpqa hle_mc \
    --cap \
    --model_name_or_path meta-llama/Llama-3.2-3B-Instruct \
    --exp_name example_exp \
    --epochs 16 \
    --shuffle \
    --seed 1 \
    --save_raw_datasets \
    --benchmark_dir benchmarks
```

## Evaluation
### Evaluating with [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness)

First, prepare a task file. See [MMLU](eval_tasks/mmlu/cap_mmlu_gen_fewshot.yaml) for an example of a multiple-choice question, and [GSM8K](eval_tasks/gsm8k/cap_gsm8k_fewshot.yaml) for an example of a direct-answer math question. For more details on creating a task file, refer to the [task guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/task_guide.md).

Then, evaluate with [lm_eval](https://github.com/EleutherAI/lm-evaluation-harness):
 ```
lm_eval \
   --model hf \
   --model_args pretrained=.cache/outputs/example_exp/checkpoint-1234 \
   --include_path eval_tasks \
   --tasks cap_arc_gen_fewshot,cap_boolq_gen_fewshot,cap_gpqa_gen_fewshot,cap_gsm8k_fewshot,cap_math_qa_fewshot,cap_mmlu_gen_fewshot \
   --output .cache/outputs/example_exp/lm_eval_harness_results \
   --log_samples \
   --batch_size auto
 ```

### Evaluating with LLM judge
First, prepare the data for the LLM judge by reusing samples from lm_eval:
```
python prepare_data_for_llm_judge.py --lm_eval_harness_results_path .cache/outputs/example_exp/lm_eval_harness_results/checkpoint-1234
```

Then, proceed with the evaluation:
```
python llm_judge.py \
    --json_file_path .cache/outputs/example_exp/llm_judge/checkpoint-1234/samples_cap_arc_gen_fewshot.json \
    --judger gpt-4.1
```
Please set the OPENAI_API_KEY environment variable to your OpenAI API key before running.
