# Using `lm_eval` with the ETR Case Generator

## Generate Problems

Run this command to generate problems:

```bash
python scripts/generate_etr_2.py --save_file_name dev --question_type=all --generate_function=random_etr_problem -n 10
```

This will generate problems in the `datasets/` directory with names starting with "dev".

You can use flags like these:

... # TODO Document flags

## Evaluate Problems

First, ensure that you have the `lm-evaluation-harness` repository cloned and installed from [here](https://github.com/EleutherAI/lm-evaluation-harness). You can run `git clone https://github.com/EleutherAI/lm-evaluation-harness.git`.

Then, run this command to evaluate the generated problems:

```bash
lm_eval/tasks/etr_problems/run_evaluation.sh --dataset /home/keenan/Dev/etr_case_generator/datasets/dev_yes_no.jsonl
```

Or, this fuller command, in which you will need to specify the full paths to the `lm-evaluation-harness` and `etr_case_generator` repositories, and the model you want to evaluate with:

```bash
lm_eval/tasks/etr_problems/run_evaluation.sh --dataset /home/keenan/Dev/etr_case_generator/datasets/dev_yes_no.jsonl -p /path/to/lm-evaluation-harness -i /path/to/etr_case_generator  -m gpt-4-turbo
```

Running this command will print out the results of the evaluation. You can see the full results of the run in `lm_eval/tasks/etr_problems/results/`. In particular, the `samples` jsonl file there will contain "resps" objects which are the model's responses to the problems.

## Viewing Results

After running the evaluation, the results will be stored in `lm_eval/tasks/etr_problems/results/`. You can view the results in a more human-readable format with this util:

```bash
pip install pprint_problems
```

Then, run this command to view the results:

```bash
pprint_problems --dir_most_recent lm_eval/tasks/etr_problems/results/ -p doc/question resps correct doc/scoring_guide/etr_conclusion doc/scoring_guide/etr_conclusion_is_categorical -n 3 -r
```

This will print out the questions, the model's responses, the correct answers, and the scoring guide for the ETR conclusion. You can adjust the `-n` flag to print out more or fewer results. The `-r` flag will randomize the order of the results. The `--dir_most_recent` flag will tell it to find the most recently modified file in the directory.

You can look at the structure of these problems with this command:

```bash
pprint_problems --dir_most_recent lm_eval/tasks/etr_problems/results/ --structure
```

You can adjust the `-p` flag to print out different parts of the problems. For example, `-p doc/question resps` will print out the questions and the model's responses, which you should see in the `--structure`.

You can generate some graphs with this command:

```bash
pprint_problems --dir_most_recent lm_eval/tasks/etr_problems/results/ --graph --parts vocab_size max_disjuncts num_variables num_disjuncts num_premises --min_n 10 --use_multiple_colors
```
