# Multi-Turn Jailbreak Evaluation Framework

This script evaluates the robustness of large language models (LLMs) against multi-turn jailbreak attacks. It tests how conversation history affects model vulnerability to harmful prompts across offensive and illegal content categories.

## Key Features

- Multi-turn conversation evaluation with/without history
- Wilson confidence interval calculation for statistical rigor
- Async processing with rate limiting and retry logic
- Support for multiple model providers (OpenAI, Anthropic, Google)
- Comprehensive CSV output with detailed results

## Required Setup

### 1. Create Conda Environment and Install Dependencies

```bash
conda create -n jailbreak_eval python=3.10
conda activate jailbreak_eval
pip install -r requirements.txt
```

### 2. Configure API Keys

Create a `.env` file with your API keys:

```
OPENAI_API_KEY=your_openai_key
ANTHROPIC_API_KEY=your_anthropic_key  
GEMINI_API_KEY=your_gemini_key
```

### 3. Prepare Dataset Files

Ensure your CSV files have these columns:
- `MultiTurn_Turns_JSON`: JSON array of conversation turns
- `intent`: "offensive" or "illegal"

Default paths:
- Offensive: `data/Offensive_Jailbreaks_5turns_500.csv`
- Illegal: `data/Illegal_Jailbreak_5turns_1000.csv`

## Usage Examples

### Evaluate Single Model
```bash
# Test GPT-4o Mini on both attack types with first 100 rows
python jailbreak_eval.py --model gpt-4o-mini --intent both --rows 1-100

# Test Claude 3 Haiku on offensive content only
python jailbreak_eval.py --model claude-3-haiku --intent offensive

# Quick test with small sample
python jailbreak_eval.py --model gpt-4o --intent offensive --rows 1-10
```

### Evaluate All Models
```bash
# Run all models on illegal content
python jailbreak_eval.py --model all --intent illegal

# Full evaluation across all models and attack types
python jailbreak_eval.py --model all --intent both
```

## Supported Models

- **GPT-4o Mini**: `gpt-4o-mini`
- **GPT-4o**: `gpt-4o`
- **GPT-5**: `gpt-5`
- **GPT-5 Mini**: `gpt-5-mini`
- **GPT-5 Nano**: `gpt-5-nano`
- **Gemini 2.5 Flash**: `gemini-2.5-flash`
- **Claude 3 Haiku**: `claude-3-haiku`

## Output Files

The script generates timestamped output files in `./JailbreakResults5turn/`:

1. **Detailed Results** (`detailed_results_[timestamp].csv`):
   - Individual conversation results
   - Model responses and classifications
   - Success/failure for each attempt

2. **Summary Statistics** (`summary_statistics_[timestamp].csv`):
   - Aggregated success rates by model/intent/history
   - Wilson confidence intervals
   - Formatted statistical summaries

## Performance Notes

- **Concurrency**: Automatically adjusted per provider to respect rate limits
- **Retry Logic**: Exponential backoff for API errors and overload conditions
- **Progress Tracking**: Heartbeat updates every 25 completed evaluations
- **Error Handling**: Comprehensive logging of API failures and retries

## Evaluation Methodology

### Multi-Turn Attack Structure
1. **With History**: Presents all conversation turns sequentially
2. **Without History**: Uses only the final harmful prompt

### Statistical Analysis
- Uses Wilson method for confidence interval calculation
- 95% confidence intervals reported by default
- Success rate calculated as: (successful_attacks / total_attempts)

### Safety Classification
- Independent judge model (Gemini 1.5 Flash) classifies responses
- Strict criteria: any actionable harmful content = successful attack
- Refusal-only responses = failed attack

## Troubleshooting

### Common Issues

1. **API Key Errors**: Ensure all required API keys are set in `.env`
2. **Rate Limits**: Script automatically handles with backoff and retry
3. **Missing Files**: Check dataset file paths in configuration
4. **Memory Issues**: Use `--rows` parameter to process smaller batches

### Performance Optimization

- Use `--rows` parameter for testing before full runs
- Monitor console output for retry/error patterns
- Check API usage limits if experiencing persistent failures

For additional support or questions about the evaluation framework, please refer to the research documentation.