# ToolEmu: Identifying the Risks of LM Agents with an LM-Emulated Sandbox

<div align="center">
  <img src="assets/figures/ToolEmu.png" width="50%">
</div>


<br>


Recent advances in Language Model (LM) agents and tool use, exemplified by applications like ChatGPT Plugins, enable a rich set of capabilities but also amplify potential risks—such as leaking private data or causing financial losses. Identifying these risks is labor-intensive, necessitating implementing the tools, manually setting up the environment for each test scenario, and finding risky cases. As tools and agents become more complex, the high cost of testing these agents will make it increasingly difficult to find high-stakes, long-tailed risks.
ToolEmu is an LM-based emulation framework that enables identifying and assessing such risks at scale, facilitating the development of safter LM agents.


This repo contains the code for:
* [testing LM agents with specific test cases in emulation](#running-specific-test-cases-in-emulation)
* [evaluating LM agents with our automatic evaluators and curated benchmark](#evaluating-lm-agents-with-our-benchmark)


The flexibility of ToolEmu makes it easy to curate new toolkits and test cases for testing LM agents.


## ToolEmu

ToolEmu assists in rapidly identifying realistic failures of LM agents across various tools and scenarios within an LM-emulated environment and facilitates the development of safer LM agents with LM-automated evaluations. It consists of 3 main components:

- **Tool Emulators**: ToolEmu uses a strong LM (e.g. GPT-4) to emulate the execution of tools in a virtual sandbox using only their specifications and inputs, without needing their implementations. This allows for faster prototyping of LM agents across different scenarios, while accommodating the evaluation of high-stakes tools that may lack existing APIs or sandbox implementations.
- **Safety & Helpfulness Evaluators**: To support scalable and quantitative risk assessments, ToolEmu includes an LM-based safety evaluator to automate the identification of potential failures caused by LM agents and quantifies the associated risk severities. To capture the potential tradeoff between safety and effectiveness, ToolEmu also includes an LM-based helpfulness evaluator.
- **Curated Benchmark**: ToolEmu ships with an initial benchmark covering 36 toolkits (311 tools) and 144 test cases for an quantitative evaluation of LM agents across various tools and scenarios. The scalablility of ToolEmu allows expanding it to more tools and scenarios.



<div align="center">
  <img src="assets/figures/Framework.jpg" width="95%">
</div>


## Setup

### Installation
To run our code, we require the installation of another package called PromptCoder. This pacakge is used to manage our [system of prompts](./toolemu/prompts/) in a modularized manner. Please note that this package is still in development.

We suggest you install the package using pip in editable mode, which means that any changes you make to the code will be instantly effective without needing to reinstall the package. To install the packages, run the following commands:
```bash
# Install the packages
cd PromptCoder
pip install -e .
cd ../ToolEmu
pip install -e .
```

### Set up API keys
After installation, you need to set up your OpenAI or Claude API keys. You can do this by creating a file named `.env` in the project directory, and then inputting your keys into this file as follows:
```bash
OPENAI_API_KEY=[YOUR_OPENAI_KEY]
```
If you want to run the `Claude` model, the `ANTHROPIC_API_KEY` is also required.

## Quick Start
### Running specific test cases in emulation
[[Run in notebook]](./notebooks/emulate.ipynb)

To begin, we offer a [notebook](./notebooks/emulate.ipynb) where you can select and run cases from our extensive curated dataset and have granular control over the setup. 
Detailed instructions are provided within.


### Evaluating LM agents with our benchmark
To evaluate a specific LM agent within our curated benchmark consisting of [144 test cases](./assets/all_cases.json) and 36 toolkits in the [`assets/`](./assets/all_toolkits.json) folder, run the following command:
```bash
python scripts/run.py
```
The script will execute the agent in our emulator (with [`scripts/emulate.py`](scripts/emulate.py)), and then evaluate the emulated trajectories (with [`scripts/evaluate.py`](scripts/evaluate.py)). 
The evaluation results will be printed to the console using [`scripts/helper/read_eval_results.py`](scripts/helper/read_eval_results.py).
To evaluate with a specific setup, specify the following arguments:
- `--agent-model`: The base model for the agent, default `gpt-4-0613`.
- `--agent-temperature`: The temperature of the agent, default 0.
- `--agent-type`: The type of agent, default `naive` with the basic prompt including only the format instructions and examples. Other options include `ss_only` (include safety requirements) or `helpful_ss` (include both safety and helpfulness requirements)
- `--simulator-type`: The type of the simulator, default to be `adv_thought` (for adversarial emulator). Another option is `std_thought` (for standard emulator).
- `--batch-size`: the batch size used for running the emulation and evaluation, default 5. You may encounter frequent rate limit error if you set it to be larger than 10.

Note that the cost for running and evaluating a test case is about **$1.2**, totalling ~$170 for running the entire dataset. 
To evaluate a subset of the test cases, you can specify the number of cases (`--trunc-num`) to run. For example, setting it to 10 will only run the first 10 test cases (after random shuffle with `--shuffle`).

For a detailed control over the pipeline, please refer to the [scripts/](scripts/README.md) folder.
