# Scripts
This folder contains scripts for executing and evaluating the LM agent on the test cases in our emulator.

## Pipeline
The complete pipeline for testing the safety and helpfulness of LM agent in our emulator contains the following steps:
1. Generation [optional]: Generate a set of toolkit specifications and test cases.
2. Execution: Execute the LM agent on the test cases in our emulator.
3. Evaluation: Evaluate the resulted execution trajectories using our automatic evaluators.

## Setup
You need to setup your `OPENAI_API_KEY` in `.env` file as
```bash
OPENAI_API_KEY=[YOUR_OPENAI_KEY]
```
You can get your API key from [OpenAI](https://platform.openai.com/account/api-keys).

## Execution
You can use the [`emulate.py`](emulate.py) script to execute the LM agent on the test cases. You may specify the following arguments:
- `--agent-model`: The base model for the agent, default `gpt-4-0613`.
- `--agent-temperature`: The temperature of the agent, default 0.
- `--agent-type`: The type of agent, default `naive` with the basic prompt including only the format instructions and examples. Other options include `ss_only` (include safety requirements) or `helpful_ss` (include both safety and helpfulness requirements)
- `--simulator-type`: The type of the simulator, default to be `adv_thought` (for adversarial emulator). Another option is `std_thought` (for standard emulator).
- `--batch-size`: the batch size used for running the emulation and evaluation, default 5. You may encounter frequent rate limit error if you set it to be larger than 10.

Moreover, if you want to run the agent on a subset of test cases, you may specify the following arguments:
- `--start-index`: The start index of the test cases to run, default 0.
- `--trunc-num`: The maximal number of test cases starting from the start index to run, default is the total number of test cases.
- `--shuffle`: Whether to shuffle the test cases before running, default `False`.
- `--selected-indexes`: A list of indexes of test cases to run, default `None`.
- `--removed-indexes`: A list of indexes of test cases to exclude, default `None`.

See the following examples:
```bash
python scripts/emulate.py --start-index 15 --trunc-num 10 # run 10 test cases from index 15 to 24 (inclusive)
python scripts/emulate.py --selected-indexes 1 3 5 # run test cases with indexes 1, 3, and 5
python scripts/emulate.py --shuffle --trunc-num 10 # run random 10 test cases
```

The path of the output file will be printed out after the execution. Let's assume the output file is `dumps/output.jsonl` for this example.

## Evaluation
You can use the [`evaluate.py`](evaluate.py) script to evaluate the resulted execution trajectories. You may specify the following arguments:
- `--eval-type`: The type of the evaluation, default `agent_safe` is the automatic safety evaluator that assesses whether the LM agent has undertaken any risky actions. Other options include `agent_help` (the automatic helpfulness evaluator that assesses how closely the LM agent's actions align with ideal behaviors for safely assisting the user instruction.)
- `--input-path`: The path to the input file containing the execution trajectories. For example, `dumps/output.jsonl` from the previous step.

Then, you can use the [`read_eval_results`](helper/read_eval_results.py) script to read the evaluation results with
```bash
python scripts/helper/read_eval_results.py dumps/output # provide the prefix and the script will read all evaluation results matched
```

## Run Both Execution and Evaluation
We have provided a script `run.py` to run the pipeline. The `run.py` script will execute the agent in our emulator (with [`scripts/emulate.py`](scripts/emulate.py)), and then evaluate the emulated trajectories (with [`scripts/evaluate.py`](scripts/evaluate.py)). The evaluation results will be printed to the console using [`scripts/helper/read_eval_results.py`](scripts/helper/read_eval_results.py).

## Generation (optional)
We will soon provide scripts that help you curate your own toolkit specifications and test cases.
