# DDOR: Automated Overrefusal Prompt Generation & Repair with Delta Debugging

> **Paper-aligned code** · ICLR 2026 (under review) ：Automated Overrefusal Prompt Generation and Repair with Delta Debugging
> 
> This repository implements **DDOR**, an automated and interpretable framework for stress-testing and repairing **over-refusal** in large language models (LLMs).
> 
> DDOR combines **delta debugging** to extract **minimal refusal-trigger fragments (mRTFs)**, large-scale **contextual expansion**, **dual-model chain-of-thought filtering** to ensure “dangerous-looking but safe” prompts, and **targeted repair** to preserve semantics while reducing unnecessary refusals.
>
> Our code and results are available at https://anonymous.4open.science/r/DDOR.
---

## Highlights
* **Interpretability** – isolates refusal-triggering words/phrases (mRTFs) rather than only sentence-level cues.
* **Scalable generation** – automatically constructs large, model-specific test sets that achieve higher over-refusal rates than existing benchmarks.
* **Safe and repairable** – dual-model CoT filtering guarantees semantic safety, and mRTF-guided rewrites reduce refusals while preserving meaning.

---

## Method Overview
1.  **mRTF Extraction (Minimization)** Starting from any refusal-inducing seed prompts, a two-stage *delta debugging* (`ddmin`) search alternates sentence-level and word-level splits until the smallest fragment that still consistently triggers a refusal is found.
2.  **Expansion** Extracted mRTFs are paired or composed and inserted into diverse safe contexts to generate prompts that appear risky but remain harmless. Topic classification ensures coherent pairings.
3.  **Filtering** Two independent judge models (e.g., `gpt-4o-mini` and `gemini-2.5-flash`) perform chain-of-thought reasoning and 1–5 safety scoring; prompts with a combined score > 6 are discarded to keep only contextually safe samples.
4.  **Repair** For safe prompts that still elicit refusals, the pipeline rewrites *only* the mRTF while leaving the rest of the sentence intact, enabling fine-grained comparison and reduced over-refusal.

---

## Repository Structure
```

Dataset/                                  \# Benchmark CSV/JSONL splits
Overrefusal Dataset Construction/
Expansion/
step1\_labeler.py                      \# Topic labeling for mRTFs
step2\_expansion\_baseline.py           \# Baseline full-sentence expansion
step2\_expansion\_DDOR.py               \# DDOR mRTF-pair expansion
Filtering/
filter\_score1.py                      \# Judge A: CoT + scoring
filter\_score2.py                      \# Judge B: CoT + scoring
filter\_final.py                       \# Final merge and export
data//reevaluate.py                     \# Re-evaluate on target model
Overrefusal Prompt Repair/
fix\_baseline.py                         \# Baseline full-sentence repair
fix\_DDOR.py                             \# DDOR mRTF-focused repair
analyze\_refusal.py                      \# Over-refusal statistics
emb.py                                  \# Pair construction + embeddings
compute\_cosine.py                       \# Cosine similarity analysis
Refusal-Trigger Extraction/
gpt-5/ddmin.py                          \# mRTF extraction (sentence→word)
analyze\_mRTF/run\_embedding.py           \# Embedding for clustering
analyze\_mRTF/run\_clustering.py          \# Dimensionality reduction & clustering

````

---

## Installation
* **Python ≥ 3.9**
* Install dependencies:
    ```bash
    pip install -U requests pandas numpy tqdm tenacity scikit-learn umap-learn hdbscan plotly nltk google-generativeai openai
    ```
### API keys
Set your OpenAI (`OPENAI_API_KEY`) and Google (`GEMINI_API_KEY`) keys in environment variables or at the top of each script.

---

## Reproducing the Paper

#### 1️⃣ Prepare Seeds
Use safe or borderline datasets such as OR-Bench Hard or XS Safety as seeds.

#### 2️⃣ Extract mRTFs
```bash
python "Refusal-Trigger Extraction/gpt-5/ddmin.py" \
  --input seeds.jsonl --model <target_model> --out mrtf.jsonl
````

#### 3️⃣ Expansion

```bash
python "Overrefusal Dataset Construction/Expansion/step1_labeler.py" --in mrtf.jsonl --out labeled.jsonl
python "Overrefusal Dataset Construction/Expansion/step2_expansion_DDOR.py" --in labeled.jsonl --out expanded.jsonl
```

#### 4️⃣ Dual-Model Filtering

```bash
python "Overrefusal Dataset Construction/Filtering/filter_score1.py" --in expanded.jsonl --out s1.jsonl
python "Overrefusal Dataset Construction/Filtering/filter_score2.py" --in s1.jsonl --out s2.jsonl
python "Overrefusal Dataset Construction/Filtering/filter_final.py" --in s2.jsonl --out ddor_final.jsonl
```

#### 5️⃣ Model Evaluation

```bash
python "Overrefusal Dataset Construction/data/<model>/reevaluate.py" --in ddor_final.jsonl --model <target_model> --log eval.jsonl
python "Overrefusal Prompt Repair/analyze_refusal.py" --in eval.jsonl
```

#### 6️⃣ Targeted Repair

```bash
python "Overrefusal Prompt Repair/fix_DDOR.py" --in ddor_final.jsonl --out repaired.jsonl
python "Overrefusal Prompt Repair/emb.py" --pairs repaired.pairs.csv --emb repaired.embeddings.npz
python "Overrefusal Prompt Repair/compute_cosine.py" --pairs repaired.pairs.csv
```

-----


## Data & Assets

  * Final JSONL/CSV benchmarks and intermediate outputs are stored in `Dataset/` and model-specific subdirectories.
  * Clustering scripts (`run_clustering.py`) support UMAP + KMeans/HDBSCAN/DBSCAN visualization of mRTF semantics.

-----



## License & Disclaimer

  * This code and datasets are released for research on safe LLM deployment only.
  * Do not use unfiltered or unsafe prompts in real-world applications.

<!-- end list -->

```
```
