# EAPrivacy: Physical-World Privacy Awareness Benchmark for LLMs

## Overview

**EAPrivacy** evaluates LLM agents' ability to handle privacy-sensitive situations in physical environments through structured spatial representations and multimodal cues.


## Quick Start

### 1. Run Complete Evaluation
Execute all tiers for a specific model with customizable variations:

```bash
bash scripts/run_all_evaluations.sh --model_name gpt-4o --num_variations 5
```

**Supported Models:** `gpt-4o`, `gpt-4o-mini`, `gemini-2.5-pro`, `claude-3.5-haiku`, `qwen-32b`, `llama-3.3-70b`, and more (see `llms.py` for full list)

### 2. Generate Results Summary
Analyze and summarize evaluation results across all tiers:

```bash
uv run scripts/summarize_results.py --format markdown
```

**Output Formats:** `console`, `markdown`, `latex`

## Installation

```bash
# Install dependencies using uv (recommended)
uv sync

# Or using pip
pip install -r requirements.txt
```

## Environment Setup

Create a `.env` file with your API keys:
```bash
OPENAI_API_KEY=your_openai_key
GEMINI_API_KEY=your_gemini_key
# Add other model API keys as needed
```

## Advanced Usage

### Run Individual Tiers
```bash
# Tier 1: Sensitive object identification
python tier1.py --model_name gpt-4o --num_variations 5

# Tier 2: Privacy in shifting environments  
python tier2.py --model_name gpt-4o --evaluation_mode rating

# Tier 4: High-stakes ethical dilemmas
python tier4.py --model_name gpt-4o --evaluation_mode selection
```

### Custom Evaluation Options
- `--num_variations`: Number of scenario variations per test case
- `--evaluation_mode`: `rating` or `selection` for applicable tiers
- `--get_reasoning`: Enable reasoning trace collection
- `--skip_rerun`: Analyze existing results without re-running models
