## TaskBench Dataset

### Structure

The TaskBench Dataset contains datasets in three areas: HuggingFace Tools, Multimedia Tools, and Dailylife APIs. Each dataset directory includes three types of files:

- `data_formulated.json` is the raw dataset generated by GPT-4. The file we submitted in the previous version.
- `data_critics.json` is the dataset obtained after checking and filtering by rule-based and LLM-based critics. Merged from the `data_formulated.json` and `alignment_ids.json` files, which were submitted in the previous version.
- `data_human.json` is the latest human-verified version. We invited a dozen human annotators to closely check and fix the samples to ensure the quality of the dataset.

```
│  README.md
│
├─data_dailylifeapis
│      data_critics.json
│      data_critics_format.json
│      data_formulated.json
│      data_human.json
│      data_human_format.json
│
├─data_huggingface
│      data_critics.json
│      data_critics_format.json
│      data_formulated.json
│      data_human.json
│      data_human_format.json
│
└─data_multimedia
        data_critics.json
        data_critics_format.json
        data_formulated.json
        data_human.json
        data_human_format.json
```

### Processing Statistics

We report the statistics of the dataset processing in the following tables.

#### Overview

| Dataset | #Samples | #Samples Checked by Critics (%) | #Samples Verified by Humans (%) |
| :-----: | :------: | :----------------: | :--------------: |
| HuggingFace Tools | 12,217 | 8,457 (69.22%) | 7,546 (61.76%) |
| Multimedia Tools | 8,904 | 6,281 (70.54%) | 5,584 (62.71%) |
| Dailylife APIs | 7,150 | 5,432 (75.97%) | 4,320 (60.42%) |

#### LLM-based and Rule-based Critics

| Dataset | #Samples | #Checked by LLM-based Critics (%) | #Checked by Rule-based Critics (%) | #Checked by Both Critics (%) |
| :-----: | :------: | :-----------------------------: | :------------------------------: | :-------------------------: |
| HuggingFace Tools | 12,217 | 9,042 (74.01%)  | 10,289 (84.22%) | 8,457 (69.22%)  |
| Multimedia Tools | 8,904 | 6,959 (78.16%) | 7,363 (82.69%) | 6,281 (70.54%) |
| Dailylife APIs | 7,150 | 5,694 (79.63%) | 6,271 (87.70%) | 5,432 (75.97%) |

#### Human Verification

| Dataset | #Samples Checked by Critics | #Correct Samples (%) | #Discarded (%) | #Fixed for Syntax (%) | #Fixed for Instructions (%) | #Fixed for Tool Invocation Graph (%) |
| :-----: | :-------------------------: | :-------------------: | :-------------------: | :---------------------------: | :-----------------------------------: | :------------: |
| HuggingFace Tools | 8,457 | 6,974 (82.46%) | 911 (10.77%) | 27 (0.32%) | 328 (3.87%) | 843 (9.96%) |
| Multimedia Tools | 6,281 | 5,262 (83.77%)  | 697 (11.09%) | 11 (0.17%) | 107 (1.70%) | 526 (9.96%) |
| Dailylife APIs | 5,432 | 4,307 (79.29%) | 714 (13.14%) | 6 (0.11%) | 92 (1.68%) | 332 (6.11%) |