# ACPBench Hard

This repo contains 8 tasks of ACPBench-Hard. To evaluate a model on ACPBench Hard, use the LM-eval-harness and custom evaluation script as shown below.


### Generate responses from LLM

```bash
lm_eval --model <your-model> \
    --model_args <model-args> \
    --tasks acp_bench_hard \
    --output <output-folder> \
    --log_samples \
    --include_path ./configs/tasks
```

### Evaluate the output

> :exclamation: Install the python [requirements](./requirements.txt) in your environment.

```bash
python ./src/evaluate_gen.py <output-folder>
```
### Logs from Experiments
