# README: Qualitative Comparison of Unique Successful Prompts

This document provides a qualitative analysis of successful attack prompts generated during the red-teaming evaluation of Large Language Models (LLMs).

---

## ⚠️ WARNING: HARMFUL CONTENT

This repository and the associated file `prompt_comparison_full_0129.html` contain **highly toxic and harmful language**. These prompts were generated during the research process to evaluate model vulnerabilities and are provided here for transparency and academic review only. **Viewer discretion is strongly advised.**

---

## 1. Overview
The comparison tool (`prompt_comparison_full_0129.html`) visualizes the diversity and effectiveness of attack prompts generated by four distinct methods:
* **🔴 Ours**: The proposed method in the research paper.
* **🟢 GFN**: Generative Flow Network-based baseline.
* **🟣 Rainbow Teaming**: A diversity-focused red-teaming framework.
* **🟠 Jailbreak-r1**: A state-of-the-art jailbreak generation baseline.

## 2. Methodology
### Greedy Clustering
To analyze the breadth of the attack surface, we performed **Greedy Clustering** on the successful prompts generated by each method:
* **Threshold**: We applied the specific threshold defined in the paper's experimental setup to group semantically similar prompts.
* **Clustering Logic**: This process groups redundant or highly similar sentences into single clusters to highlight the unique types of successful attacks produced by each methodology.

## 3. Data Fields & Metrics
Each prompt entry in the visualization includes the following metadata:

| Field | Description |
| :--- | :--- |
| **`n` (Cluster Size)** | The total number of successful prompts belonging to this cluster. A larger `n` indicates a higher frequency of generating this specific type of attack. |
| **Toxic** | The toxicity score of the generated instruction as measured by our evaluation framework (scale 0.0 - 1.0). |
| **Category** | The harm category assigned to the prompt, such as **Fraud/Scam**, **Adult Content**, **Privacy**, or **Hate Speech**. |

## 4. Usage
Open the `prompt_comparison_full_0129.html` file in any modern web browser. You can:
* **Search**: Filter prompts by keywords using the search bar for each method.
* **Sort**: Organize results by **Cluster Size (`n`)**, **Toxicity**, or **Category**.
* **Compare**: Visually inspect the qualitative differences in instruction text between our proposed method and the baselines.