# Refusal Index

This repository contains the source code for the paper **"Can LLMs Refuse Questions They Do Not Know? Measuring Knowledge-Aware Refusal in Factual Tasks."** 

## Overview

The **Refusal Index** is a metric for evaluating how well Large Language Models (LLMs) can refuse to answer questions when they lack the necessary knowledge. This capability is crucial for building trustworthy AI systems that avoid hallucination and confidently express uncertainty.

## Prerequisites

### System Requirements

- **Python**: `>=3.11, <3.13` (see `pyproject.toml`)
- **Package Manager**: [uv](https://github.com/astral-sh/uv) for fast dependency management

### Installation

```bash
# Install dependencies
uv sync
```

### API Configuration

Configure the appropriate API keys based on your chosen inference backend:

```bash
export OPENROUTER_API_KEY="your_api_key_here"
export OPENROUTER_BASE_URL="https://openrouter.ai/api/v1"  # Optional, defaults to this
```

```bash
export GOOGLE_API_KEY="your_google_api_key_here"
```

### vLLM Setup

Start an OpenAI-compatible vLLM server:
```bash
# Example server startup
vllm serve your-model-name --port 8001 --host 0.0.0.0
```

## Configuration

### Model Configuration

Model configurations are stored as JSON files in the `model_configs/` directory. See `model_configs/example.json` for a template.

**Common configuration keys:**

| Key | Description | Type | Required |
|-----|-------------|------|----------|
| `model_name` | Model identifier (e.g., "gpt-4", "gemini-2.0-flash") | string | Yes |
| `inference_backend` | Backend type: `openai`, `google`, `vllm`, `vllm_offline` | string | Yes |
| `temperature` | Sampling temperature (0.0 = deterministic) | float | Yes |
| `top_p` | Nucleus sampling parameter | float | Yes |
| `max_tokens` | Maximum output tokens | integer | Yes |
| `suffix` | Text to append to prompts | string | No |
| `max_thinking_tokens` | Tokens for reasoning (Gemini models) | integer | No |

Most configuration values can be overridden via command-line flags during evaluation.

### Supported Datasets

| Dataset | Description | Source |
|---------|-------------|---------|
| `simpleqa` | Basic factual questions | HuggingFace Hub |
| `precisewiki` | Wikipedia-based QA (local JSONL) | Local file required |
| `unanswerable` | Questions with no correct answer | Salesforce FaithEval |
| `counterfactual` | Hypothetical scenarios | Salesforce FaithEval |
| `inconsistent` | Contradictory information tasks | Salesforce FaithEval |


### Output Structure

Each evaluation run creates timestamped output directories:

```
results/YYYYMMDD_HHMMSS-ri-<dataset>-<model>/
├── <dataset>_<model>_<temperature>_<samples>_metrics.json  # Summary metrics
└── panswer_*.json                                         # P(Answer) results (if applicable)

logs/YYYYMMDD_HHMMSS-ri-<dataset>-<model>/
├── <dataset>_<model>.csv                                  # Per-item predictions & grades
├── prompt_*_samples.json                                  # Detailed prompt logs
└── panswer_*.csv                                          # P(Answer) detailed results (if applicable)
```

## Evaluation Methods

### Two-Pass Evaluation

The core evaluation method uses a **two-pass prediction and grading LLM** to measure how well models can identify and refuse questions they cannot answer correctly.

1. **Prediction Phase**: Model generates answers using different prompts (default: `prompts/PROMPT_A.txt` and `prompts/PROMPT_C.txt`)
2. **Grading Phase**: A grader model (same or different) evaluates responses using `prompts/GRADER.txt`
3. **Classification**: Each response receives an A/B/C grade:
   - **A**: Correct answer
   - **B**: Incorrect answer  
   - **C**: Explicit refusal (model says "I don't know" or similar)


```bash
# Basic evaluation with OpenRouter/Gemini/vLLM
uv run python -m src.evaluation.evaluate \
  --model_config example \
  --dataset_name simpleqa \
  --max_samples 2000 \
  --num_proc 50
```


**Required Parameters:**
- `--model_config`: Name of JSON config file in `model_configs/` (without `.json` extension)

**Dataset & Runtime Options:**
| Argument | Options | Default | Description |
|----------|---------|---------|-------------|
| `--dataset_name` | `simpleqa`, `trivialqa`, `mix`, `math`, `precisewiki`, `precisewikiref`, `unanswerable`, `counterfactual`, `inconsistent` | - | Dataset to evaluate |
| `--max_samples` | Integer | 6000 | Maximum samples to process |
| `--num_proc` | Integer | 50 | Number of parallel processes |
| `--results_base` | Path | `results` | Base directory for results |
| `--logs_base` | Path | `logs` | Base directory for logs |
| `--verbose` | Flag | False | Enable verbose logging |

**Grading Configuration:**
- `--use_same_model_for_grading`: Use the same model for grading (default: False)
- `--grader_backend`: Separate grader backend if not using same model
- `--grader_model`: Separate grader model name

**Authentication:**
- `--google_api_key`: Google API key (defaults to `GOOGLE_API_KEY` environment variable)

The evaluation produces output files:

**CSV Results** (`logs/.../dataset_model.csv`):
- `question`, `answer`: Original question-answer pairs
- `pred_raw{i}`, `pred{i}`: Raw and processed predictions for each prompt
- `grade{i}`: A/B/C grades for each prompt
- `char_count{i}`: Character counts for response length analysis

**JSON Metrics** (`results/.../dataset_model_temp_samples_metrics.json`):
- Per-prompt summary statistics
- Accuracy, refusal rates, and derived metrics
- Used as input for Refusal Index computation

**Detailed Logs** (`logs/.../prompt_*_samples.json`):
- Complete prediction details in Apricot-compatible format
- Useful for debugging and manual inspection

### P(Answer) Evaluation

**P(Answer)** evaluation estimates the probability that a model attempts to answer a question (as opposed to refusing). 

1. **Multiple Sampling**: Generate multiple responses per question using higher temperature
2. **Classification**: Grade each response as attempting an answer (A/B) or refusing (C)
3. **Probability Estimation**: Calculate P(Answer) as the fraction of attempted responses
4. **Calibration Analysis**: Compute calibration metrics (ECE, RMSCE, Brier Score, AUROC)
5. **Visualization**: Generate calibration plots and reliability diagrams

```bash
# P(Answer) evaluation with calibration analysis
uv run python -m src.evaluation.evaluate_panswer \
  --model_config example \
  --dataset_name simpleqa \
  --prompt_template PROMPT_D \
  --max_samples 2000 \
  --num_proc 50 \
  --n_samples_panswer 40 \
  --panswer_temperature 1.0
```


**P(Answer) Specific Options:**
- `--prompt_template`: Prompt template file (e.g., `PROMPT_B`, `PROMPT_C`, `PROMPT_D`)
- `--n_samples_panswer`: Number of samples per question for probability estimation (default: 40)
- `--panswer_temperature`: Temperature for P(Answer) sampling (default: 1.0)  
- `--batch_size`: Batch size for efficient processing

**Inherited Options:**
All other parameters work the same as two-pass evaluation (`--model_config`, backend overrides, etc.)


**Output Files:**
- `panswer_*.csv`: Per-question P(Answer) probabilities and sample details
- `panswer_metrics_*.json`: Calibration metrics and summary statistics  
- Calibration plots 

**Calibration Metrics:**
- **ECE** (Expected Calibration Error): Measures calibration quality (lower = better)
- **RMSCE** (Root Mean Square Calibration Error): Alternative calibration measure
- **Brier Score**: Overall prediction accuracy (lower = better)
- **AUROC**: Discrimination ability (higher = better, 0.5 = random)

## Refusal Index Computation

The **Refusal Index (RI)** is the core metric of this framework, measuring the correlation between a model's knowledge (correctness) and its willingness to refuse answering questions.

### Computing RI with Confidence Intervals

For robust statistical analysis, compute RI with bootstrap confidence intervals:

```bash
# Full analysis with bootstrap CIs (100 bootstrap samples)
uv run python -m src.experiments.refusal_index_analysis_ci

# Quick test (10 bootstrap samples)
uv run python -m src.experiments.refusal_index_analysis_ci --test-mode
```

**Outputs:**
- `refusal_index_with_ci.csv`: RI values with confidence intervals
- `refusal_index_with_ci.png`: Visualization of results across models/prompts

### Additional Metrics

Compute traditional accuracy, refusal rates, and composite metrics:

```bash
uv run python -m src.experiments.simple_metric_analysis_ci \
  --dataset simpleqa \
  --output results/simple_metrics_with_ci.csv
```

**Computed metrics:**
- **Accuracy (c)**: Fraction of correct answers
- **Refusal Rate (r)**: Fraction of questions refused
- **Correct-Attempted**: Accuracy on non-refused questions
- **F-Score**: Harmonic mean of precision and recall
- **Weighted Metric**: Composite score balancing accuracy and refusal

## Complete Workflow

Here's the recommended end-to-end workflow for reproducing the paper's results:

### Step 1: Setup
```bash
# Install dependencies
uv sync

# Configure API keys
export OPENROUTER_API_KEY="your_key"  # or GOOGLE_API_KEY
```

### Step 2: Create Model Configurations
```bash
# Create JSON config files in model_configs/
# Example: model_configs/gpt4.json, model_configs/gemini.json
```

### Step 3: Run Evaluations
```bash
# Two-pass evaluation for each model/dataset combination
uv run python -m src.evaluation.evaluate \
  --model_config your_model \
  --dataset_name simpleqa \
  --max_samples 2000

# Optional: P(Answer) evaluation for calibration analysis  
uv run python -m src.evaluation.evaluate_panswer \
  --model_config your_model \
  --dataset_name simpleqa \
  --n_samples_panswer 40
```

### Step 4: Create Evaluation Manifest
```bash
# Create evaluation_runs_core_datasets.json pointing to your results
# See "Computing Basic Refusal Index" section for format
```

### Step 5: Compute Metrics
```bash
# Refusal Index with confidence intervals
uv run python -m src.experiments.refusal_index_analysis_ci

# Simple metrics with confidence intervals
uv run python -m src.experiments.simple_metric_analysis_ci \
  --dataset simpleqa \
  --output results/simple_metrics.csv
```

### Step 6: Generate Visualizations
```bash
# Open and run the visualization notebooks:
# - src/visualization/plot_calibration.ipynb
# - src/visualization/plot_comparison_auroc_panswer.ipynb  
# - src/visualization/table_stability_prompt.ipynb
```