# SCARE: A Benchmark for SQL Correction and Question Answerability Classification for Reliable EHR Question Answering

## Overview

SCARE is a comprehensive benchmark designed to evaluate post-hoc safety mechanisms for Electronic Health Record (EHR) question answering systems. The benchmark addresses the critical need for reliable verification and correction of SQL queries generated by text-to-SQL models in safety-critical clinical environments.

## Abstract

Recent advances in Large Language Models (LLMs) have enabled the development of text-to-SQL models that allow clinicians to query structured data stored in Electronic Health Records (EHRs) using natural language. However, deploying these models for EHR question answering (QA) systems in safety-critical clinical environments remains challenging. Incorrect SQL queries—whether caused by model errors or problematic user inputs—can undermine clinical decision making and jeopardize patient care.

While prior work has mainly focused on improving SQL generation accuracy or filtering questions before execution, there is a lack of standardized benchmark for evaluating independent post-hoc verification mechanisms, which are crucial for safe deployment.

To fill this gap, we introduce **SCARE**, a benchmark for evaluating methods that function as post-hoc safety layers in EHR QA systems. SCARE evaluates the joint task of:
1. **Question Answerability Classification**: Determining whether a question is answerable, ambiguous, or unanswerable
2. **SQL Query Verification and Correction**: Verifying or correcting candidate SQL queries

## Dataset

The benchmark comprises **4,200 triples** of questions, candidate SQL queries, and expected model outputs, grounded in three major clinical databases:

- **MIMIC-III**: Medical Information Mart for Intensive Care III
- **MIMIC-IV**: Medical Information Mart for Intensive Care IV  
- **eICU**: eICU Collaborative Research Database

The dataset covers a diverse set of questions and corresponding candidate SQL queries generated by **seven different text-to-SQL models**.

## Repository Structure

```
├── correction-data/           # SQL correction evaluation data
│   ├── eicu_test_set_correction_data.json
│   ├── mimic_iv_test_set_correction_data.json
│   ├── mimicsql_test_set_correction_data.json
│   └── table_5/              # Detailed results by score ranges
├── data_final/               # Final test datasets
│   ├── *_test.json          # Complete test sets
│   ├── *_ans_test.json      # Answerable questions
│   ├── *_unans_test.json    # Unanswerable questions
│   └── *_ambig_test.json    # Ambiguous questions
├── databases/                # Database files and schemas
│   ├── eicu/
│   ├── mimic_iv/
│   ├── mimicsql/
│   └── tables.json
├── prompt/                   # Prompt templates for different approaches
│   ├── multi_turn_correction/
│   ├── single_turn_correction/
│   ├── two_stage_pipeline/
│   └── verifier_correction/
├── scripts/                  # Evaluation scripts
└── src/                     # Source code
    ├── evaluate_sql_correction.py
    ├── multi_turn_correction.py
    ├── single_turn_correction.py
    ├── table_5.py
    ├── table_6.py
    └── utils/
```

## Evaluation Methods

The benchmark supports evaluation of various approaches:

### 1. Two-Stage Pipeline
- Separate classification and correction stages
- Modular approach for better interpretability

### 2. Single-Turn Correction
- Direct SQL correction in a single model interaction
- Includes experimental version with classification

### 3. Verifier-Based Correction
- Uses verification feedback for iterative improvement
- Includes experimental version with classification

### 4. Multi-Turn Correction  
- Iterative correction through multiple model interactions
- Includes experimental version with classification

