# Unsafer in Many Turns: Benchmarking and Defending Multi-Turn Safety Risks in Tool-Using Agents

This repository contains the code and data for **MultiTurnRisks**, a benchmark for evaluating multi-turn safety in tool-using agents, and **ToolShield**, a training-free defense that leverages the agent's own capabilities to improve safety awareness.

## Overview
![Overview](overview.pdf)

LLM-based agents can take actions that cause real-world harm, yet current safety mechanisms focus primarily on single-turn, text-based interactions. This work addresses this gap with:

1. **A principled attack taxonomy** that transforms single-turn harmful tasks into multi-turn attack sequences across three dimensions: Format, Method, and Target.

2. **MultiTurnRisks Benchmark**: 365 single-turn harmful tasks across five tools (Filesystem-MCP, Browser/Playwright-MCP, PostgreSQL-MCP, Notion-MCP, Terminal), transformed into multi-turn attack sequences using our taxonomy.

3. **ToolShield Defense**: A training-free, tool-agnostic defense where agents explore tool functionality in a sandbox, learn from their own executions, and distill safety experiences before deployment.


## Key Results

Our evaluations reveal significant safety degradation in multi-turn settings:
- **Attack**: ASR increases by 27% for Claude-4.5-Sonnet, 23% for Qwen3-Coder, and 14% for GPT-5.2 when harm is distributed across turns.
- **Defense**: ToolShield reduces ASR by 50% for Claude-4.5-Sonnet, 24% for Gemini-3-Flash, and 19% for GPT-5.2 in multi-turn settings.

## Repository Layout
```
./
├── workspaces/                              # Runnable task bundles (inputs to evaluation)
│   ├── single_turn_tasks/                   # Single-turn tasks (one prompt end-to-end)
│   │   └── <tool>/                          # Environment: fs/playwright/postgres/terminal/...
│   │       └── <task_name>/
│   │           ├── task.md                  # Instruction shown to the agent
│   │           ├── checkpoints.md           # Human-readable success criteria
│   │           ├── workspace(mcp_fs)/       # Pre-seeded sandbox filesystem contents
│   │           └── utils/                   # Task helpers
│   │               ├── evaluator.py         # Programmatic scoring logic
│   │               └── dependencies.yml     # MCP/tools required by this task
│   ├── multi_turn_tasks/                    # Multi-turn tasks (staged instructions)
│   │   └── <tool>/
│   │       └── <task_name>/
│   │           ├── turns.yml                # Turn manifest (order, patterns)
│   │           ├── task-turn-1.md           # Turn 1 instruction
│   │           ├── task-turn-2.md           # Turn 2 instruction
│   │           ├── ...                      # Additional turns
│   │           ├── workspace(mcp_fs)/       # Pre-seeded sandbox filesystem contents
│   │           └── utils/                   # Task helpers
│   │               ├── evaluator.py         # Programmatic scoring logic
│   │               └── dependencies.yml     # MCP/tools required by this task
│   └── README.md                            # Detailed task folder conventions
├── evaluation/                              # Task execution and scoring harness
│   ├── run_eval.py                          # Main runner (loads task, sets up MCP, runs agent)
│   ├── test_safearena.sh                    # Helper to reset external web arena (optional)
│   └── README.md                            # Prereqs + configuration
├── task_decomposition/                      # Attack taxonomy: single-turn -> multi-turn
│   ├── decomposition_prompt.py              # Prompt template for decomposition
│   └── run_decomposition.py                 # CLI to generate multi-turn task bundles
└── self_exploration/                        # ToolShield: self-exploration pipeline
    ├── tree_generation.py                   # Generates safety trees + task/evaluator bundles
    ├── prompts.py                           # Prompt templates
    ├── post_process_prompts.py              # Shared utilities/prompt fragments
    └── iterative_exp_runner.py              # Iterative “experience” generation
```

## Quick Start

### 1. Environment preparation
This codebase builds on OpenAgentSafety, SafeArena, and MCPMark.

- OpenHands + OwnCloud/GitLab setup: https://github.com/sani903/OpenAgentSafety?tab=readme-ov-file
- Reddit/Shopping/Shopping Admin setup (SafeArena): https://github.com/McGill-NLP/safearena/blob/main/README.md
- MCP services setup (MCPMark): https://github.com/eval-sys/mcpmark/blob/main/README.md

### 2. Running evaluation

Use `evaluation/run_eval.py` to evaluate agents on ToolShield tasks:
```bash
poetry run python evaluation/run_eval.py \
  --task-path workspaces/single_turn_tasks/single_playwright/safety-aid-misogyny \
  --agent-llm-config <agent_group> \
  --env-llm-config <env_group> \
  --outputs-path evaluation/test_output \
  --server-hostname localhost
```

After you have trajectories, use `evaluation/post_eval.py` to run the LLM-as-judge (requires `OPENROUTER_API_KEY`):
```bash
poetry run python evaluation/post_eval.py \
  --workspace-root <repo_root> \
  --tasks-dir <task_dir> \
  --output-dir <trajectory_dir> \
  --model <openrouter_model> \
  --output <results_json_path>
```

### 3. Generating multi-turn attack sequences (taxonomy)

Transform single-turn tasks into multi-turn attack sequences using our taxonomy:
```bash
poetry run python task_decomposition/run_decomposition.py \
  --task-folder workspaces/single_turn_tasks/<task_name>
```

This reads `task.md` from the task folder, applies the taxonomy transformation, and outputs multi-turn task bundles (`turns.yml`, `task-turn-*.md`, etc.).

### 4. Running ToolShield Defense (self-exploration)

Generate safety experiences for a new tool:
```bash
# Step 1: Generate safety trees and test cases
poetry run python self_exploration/tree_generation.py \
  --tool-spec <path_to_tool_documentation>

# Step 2: Execute test cases and generate experiences
poetry run python self_exploration/iterative_exp_runner.py \
  --task-folder <generated_tasks> \
  --outputs-path <experience_output>
```

The defense pipeline:
1. **Test Case Synthesis**: Generates single-turn and multi-turn test cases from tool documentation
2. **Simulated Execution**: Executes test cases in a sandbox to obtain trajectories
3. **Experience Generation**: Extracts safety experiences from execution traces
4. **Experience Update**: Iteratively refines experiences based on new observations

## Notes

- Some scripts contain hard-coded container paths (e.g., `/mnt/data/OpenAgentSafety`). Update these constants if running locally outside the expected container/mount layout.
- See individual README files in `evaluation/` and `workspaces/` for detailed setup instructions.
- Due to the submission size limit, we only include a small subset of 40 decomposed multi-tasks with their corresponding single-turn tasks

## References
```bibtex
@article{vijayvargiya2025openagentsafety,
  title={OpenAgentSafety: A Comprehensive Framework for Evaluating Real-World AI Agent Safety},
  author={Vijayvargiya, Sanidhya and Soni, Aditya Bharat and Zhou, Xuhui and Wang, Zora Zhiruo and Dziri, Nouha and Neubig, Graham and Sap, Maarten},
  journal={arXiv preprint arXiv:2507.06134},
  year={2025}
}

@inproceedings{tur2025safearena,
  title={SafeArena: Evaluating the Safety of Autonomous Web Agents},
  author={Tur, Ada Defne and Meade, Nicholas and L{\"u}, Xing Han and Zambrano, Alejandra and Patel, Arkil and Durmus, Esin and Gella, Spandana and Stanczak, Karolina and Reddy, Siva},
  booktitle={Forty-second International Conference on Machine Learning},
  year={2025}
}

@article{wu2025mcpmark,
  title={MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use},
  author={Wu, Zijian and Liu, Xiangyan and Zhang, Xinyuan and Chen, Lingjun and Meng, Fanqing and Du, Lingxiao and Zhao, Yiran and Zhang, Fanshi and Ye, Yaoqi and Wang, Jiawei and others},
  journal={arXiv preprint arXiv:2509.24002},
  year={2025}
}
```
