# SafeHarbor

A Memory Tree-based safety defense system for LLM Agents, designed to detect and block harmful requests while minimizing false refusals.

## Overview

This project implements a multi-layered safety defense system:

1. **Safety Projector**: A prototype-based metric learning model that calculates harmful probability and benign similarity scores
2. **Memory Tree (RiskTree)**: A tree-based retrieval system for safety rules with dynamic routing
3. **Proxy Server**: An intercepting proxy that evaluates requests before forwarding to the target LLM

## Project Structure

```
├── proxy_server.py          # Main proxy server with safety evaluation
├── src/
│   ├── risk_tree.py         # Memory Tree implementation
│   ├── SafetyProjector.py   # Safety Projector model
│   ├── llama_guard.py       # Llama Guard integration
│   └── models/              # Pre-trained model weights
├── benchmark/               # AgentHarm benchmark grading functions
├── agents/                  # Agent implementations
├── Agent-SafetyBench/       # Agent-SafetyBench integration
├── A_mem/                   # A-mem retrieval system
└── code/                    # GuardAgent and evaluation code
```

## Key Components

### Safety Projector

- Calculates `harmful_prob` (0.0-1.0) and `max_benign_sim` (0.0-1.0)
- Uses prototype-based metric learning with margin and lambda hyperparameters

### Memory Tree Routing

- **SAFE Branch**: `harmful_prob < 0.30` AND `max_benign_sim > 0.80` → Skip LLM judgment
- **BLOCK Branch**: `harmful_prob > 0.92` → Requires LLM confirmation
- **AMBIGUOUS Branch**: Other cases → Requires LLM judgment

### First LLM Judgment

- Uses retrieved benign boundary rules and exemptions
- Follows "Presumption of Utility" principle
- Outputs SAFE/HARMFUL verdict

## Configuration

Set the following environment variables before running:

```bash
# API Configuration
export OPENAI_API_KEY="YOUR_OPENAI_API_KEY"
export DEEPSEEK_API_KEY="YOUR_DEEPSEEK_API_KEY"
export DEEPSEEK_BASE_URL="YOUR_DEEPSEEK_BASE_URL"

# Memory Tree Configuration
export MEMORY_SYSTEM_TYPE="memory_tree"
export USE_DYNAMIC_RULE="true"
export USE_STATIC_RULE="false"
export USE_BENIGN_HARMFUL_RULE="true"
export ENABLE_SAFETY_PROJECTION="true"
export USE_SINGLE_BRANCH_MODE="false"
export USE_EXTERNAL_LLM_JUDGMENT="true"
export SKIP_FIRST_LLM_JUDGMENT="false"
export FIRST_LLM_OUTPUT_FORMAT="label"  # or "json"
```

## Usage

### Start Proxy Server

```bash
python proxy_server.py
```

### Run AgentHarm Experiments

```bash
# Parallel execution for multiple models
bash run_agentharm_parallel.sh

# Single model execution
bash run_agentharm_single.sh
```

### Run Classification Experiments

```bash
bash run_classification_experiments.sh
```

### Run Safety Projector Ablation

```bash
bash run_safety_projector_ablation.sh
```

## Experiments

### Classification Experiment

Evaluates binary classification (safe/unsafe) across multiple models with and without retrieved rules.

### Attack Experiment

Evaluates model robustness against adversarial prompts (original vs. mutated).

### Safety Projector Ablation

Studies the effect of `margin` and `lambda` hyperparameters on classification performance.

## Metrics

- **ASR (Attack Success Rate)**: Percentage of harmful queries bypassing the safety system
- **FRR (False Refusal Rate)**: Percentage of benign queries incorrectly refused

## License

See LICENSE file.

## Citation

If you use this code in your research, please cite our work.
