## ReaLMistake Error Detection Benchmark

Benchmark and code for the paper ``Evaluating LLMs at Detecting Errors in LLM Responses''.

### Directory Structure

```
.
├── README.md
├── data  # ReaLMistake benchmark
├── sh  # scripts for evaluation 
├── src  # evaluation code
├── dataset_stats  # statistics of ReaLMistake benchmark and performance of easy baselines
├── error_detection_performance
│   ├── table  # tables for error detection performance
│   └── performance  # json files for error detection performance
└── error_detection_outputs  # all outputs from LLM-based error detectors
```
