### Artifact submission to ICLR 2026

### Description
VerifyThisBench is a new benchmark designed to evaluate LLMs on end-to-end program verification tasks that require interpreting natural language problem descriptions, formulating formal specifications, generating code, and constructing correctness proofs. 

You can find the dataset under `/VerifyThisBench`, organized by years. For each challenge you can find the descriptions.txt and task files. 

The relax version, can be found under  `/VerifyThisBenchXS`, organized by years and tools. `solution.*` is the human written solution. Variants are, `fill-implementation`, `fill-specification`, `fill-loop-invariant`. `split` indicates partial solution of that form is given. 

Additionally, you can find example system prompts and coherent prompt under `/prompts`. Docker files to set up the evaluation environments are available under `/evns`. Example scripts to evaluate LLMs can be found under `/scripts`. 

### Example Usage
You need to set up the api clients and adjust path/docker image names/output path in the scripts. 

To evaluate on VerifyThisBench
```bash
python query_with_feedback.py --tool dafny --attempt 5 --timelimit 60
```

To evaluate on VerifyThisBenchXS
```bash
python query_relaxed_with_feedback.py --attempt 5 --timelimit 60
```


