# Code for the submission "TaskBench: Benchmarking Large Language Models for Task Automation"

All the datasets and evaluation scripts used in the paper. With this repository, you can reproduce the experiments from our paper and also evaluate the task automation capabilities of other large language models.
## Setup

```bash
pip install -r requirements.txt
```

> There may still be some dependencies missing. Please install them dynamically according to the message displayed during execution.

Additionally, if you wish to evaluate open-source large language models, you will also need to deploy the LLMs locally using an **OpenAI-compatible API**. We recommend using the `fastchat` tool to deploy the service to the `localhost:8000` endpoint.

```bash
pip install fastchat
pip install vllm
pip install "fastapi[all]"

python3 -m fastchat.serve.controller
python3 -m fastchat.serve.vllm_worker --model-path lmsys/vicuna-7b-v1.3
python3 -m fastchat.serve.openai_api_server --host localhost --port 8000
```

## Generate the Dataset

We provide datasets for three domains: Hugging Face Tools (`data_huggingface`), Multimedia Tools (`data_multimedia`), and Daily Life APIs (`data_dailylifeapis`). If you want to generate your own dataset, first build the tool library and use the script generate_graph.py to generate the tool graph. Then, run:

```bash
# specify the graph and tool description file
python data_engine.py \
    --graph_desc data_multimedia/graph_desc.json \
    --tool_desc data_multimedia/tool_desc.json \
    --llm gpt-4 \
    --temperature 1.0 \
    --top_p 1.0 \
    --ignore_tool_type false \
    --save_figure false \
    --api_addr localhost \
    --api_port 8000 \
    --check true \
    --use_async true \
    --multiworker 5
```

Some samples in the dataset generated by GPT-4 have formatting errors. Therefore, we need to perform additional formatting to filter or transform the incorrectly formatted samples.

```bash
python formulate.py \
    --data_dir data_multimedia \
    --ignore_tool_type false
```

## Inference

For convenience, it is recommended to deploy all LLMs to the same endpoint, such as `localhost:8000`. To generate the prediction file on TaskBench, specify the name of the LLM using the following command:

```bash
python inference.py \
    --llm gpt-4 \
    --data_dir data_multimedia \
    --temperature 0.2 \
    --top_p 0.1 \
    --api_addr localhost \
    --api_port 8000 \
    --multiworker 5 \
    --use_demos 0 \
    --reformat true \
    --reformat_by self \
    --log_first_detail true \
    --use_demos 2 \
    --ignore_tool_type false \
    --tag true
```

## Evaluation

With the predictions in place, you can now evaluate the large language model. The predictions file is saved by default in the dataset's folder under the name `predictions`. Execute the following command to calculate the evaluation metrics (saved in the `metrics` folder):

```bash
python evaluate.py \
    --data_dir data_multimedia \
    --prediction_dir $prediction_dir \
    --llm gpt-4 \
    --splits all \
    --n_tools all \
    --mode add \
    --ignore_tool_type false \
    -m all
```