# Project Name

This software project accompanies the anonymized submission to ICLR 2026.

We introduce a novel method for improving LLM tool calling accuracy. Our approach uses a template-based generation instead of existing schema-constrained generation. Experiments on different datasets and LLM models demonstrate that our method improves F1 scores for tool names and parameters on most tests.

## Getting Started

To reproduce the work in the paper, kindly follow these steps.

### Setup

Project is developed and tested under Python3.10 with venv.

```commandline
python3.10 -m venv venv
. ./venv/bin/activate
pip install -r requirements.txt
```

We already provided generated data, you can go to [Test on GPT](#test-on-gpt) if you don't want to regenerate the data.

### Get original data

Follow the README files for each dataset to obtain the original data files and put in corresponding places:
[API-Bank](./data/api_bank/README.md), [ToolACE](./data/tool_ace/README.md), [When2Call](./data/when2call/README.md)

### Convert the original data

Convert the data from different datasets to a unified format.
```commandline
scripts/convert_api_bank_data_files.sh
scripts/convert_tool_ace_data_files.sh
scripts/convert_when2call_data_files.sh
```
For ToolACE, only the first two turns are used as the following turns involves conversation for thinking which is beyond the scope of the paper. 

### Test on GPT

We already provided results from GPT-4o and GPT-5 under `data/${DATASET}/generated` folder. You can directly generate metrics and check reports.
```commandline
# the values are not used, just set it to some random value
export OPENAI_API_KEY="NOT_USED"

scripts/eval_hosted_llm.sh apibank gpt4 src \
  data/api_bank/generated/test-data data/api_bank/generated/test-data level-1-api.json,level-2-api.json
scripts/eval_hosted_llm.sh toolace gpt4 src \
  data/tool_ace/generated data/tool_ace/generated data.json.test
scripts/eval_hosted_llm.sh when2call gpt4 src \
  data/when2call/generated data/when2call/generated when2call_test_mcq.jsonl
  
scripts/eval_hosted_llm.sh apibank gpt5 src \
  data/api_bank/generated/test-data data/api_bank/generated/test-data level-1-api.json,level-2-api.json
scripts/eval_hosted_llm.sh toolace gpt5 src \
  data/tool_ace/generated data/tool_ace/generated data.json.test
scripts/eval_hosted_llm.sh when2call gpt5 src \
  data/when2call/generated data/when2call/generated when2call_test_mcq.jsonl
```

You can also regenerate the results, just add `--reset` to the command line, and set correct API keys.
This will remove existing result files, then query GPT and Gemini models.
```commandline
# the values are requried
export OPENAI_API_KEY="YOUR_OPENAI_KEY"

scripts/eval_hosted_llm.sh apibank gpt4 src \
  data/api_bank/generated/test-data data/api_bank/generated/test-data level-1-api.json,level-2-api.json --reset
scripts/eval_hosted_llm.sh toolace gpt4 src \
  data/tool_ace/generated data/tool_ace/generated data.json.test --reset
scripts/eval_hosted_llm.sh when2call gpt4 src \
  data/when2call/generated data/when2call/generated when2call_test_mcq.jsonl --reset

scripts/eval_hosted_llm.sh apibank gpt5 src \
  data/api_bank/generated/test-data data/api_bank/generated/test-data level-1-api.json,level-2-api.json --reset
scripts/eval_hosted_llm.sh toolace gpt5 src \
  data/tool_ace/generated data/tool_ace/generated data.json.test --reset
scripts/eval_hosted_llm.sh when2call gpt5 src \
  data/when2call/generated data/when2call/generated when2call_test_mcq.jsonl --reset
```

Each command will run test on given model and dataset for schema-constrained generation(control) and template-based generation(treatment).
You should see some report like following. `marco_averaged_F1_api` and `marco_averaged_F1_param` are the values reported in paper.
```
                             Control    Treatment      Delta  P-value
golden_api_num           1295.000000  1295.000000   0.000000      NaN
golden_param_num         2970.000000  2970.000000   0.000000      NaN
predicted_api_num        1600.000000  1671.000000  71.000000      NaN
predicted_param_num      3964.000000  3963.000000  -1.000000      NaN
correct_api_num          1094.000000  1126.000000  32.000000      NaN
correct_param_num        2392.000000  2408.000000  16.000000      NaN
prediction_amount           0.438116     0.457558   0.019441      NaN
micro_averaged_P_api        0.683750     0.673848  -0.009902      NaN
micro_averaged_R_api        0.844788     0.869498   0.024710      NaN
micro_averaged_F1_api       0.755786     0.759272   0.003486      NaN
micro_averaged_P_param      0.603431     0.607620   0.004190      NaN
micro_averaged_R_param      0.805387     0.810774   0.005387      NaN
micro_averaged_F1_param     0.689934     0.694649   0.004715      NaN
marco_averaged_P_api        0.299562     0.308324   0.008762   0.0062
marco_averaged_R_api        0.299562     0.308324   0.008762   0.0044
marco_averaged_F1_api       0.299562     0.308324   0.008762   0.0062
marco_averaged_P_param      0.259890     0.264128   0.004238   0.1606
marco_averaged_R_param      0.271791     0.274287   0.002496   0.4132
marco_averaged_F1_param     0.263507     0.267040   0.003533   0.2402
```

To test GPT-5 with different reasoning effort, set the environment variable `REASON` to following values: 
`minimal, low, medium, high`. Default value is `minimal`.
```commandline
REASON=high scripts/eval_hosted_llm.sh apibank gpt5 src \
  data/api_bank/generated/test-data data/api_bank/generated/test-data level-1-api.json,level-2-api.json
```

### Test on Mistral and DeepSeek-Coder

#### Fine-tune the models

First fine-tune Mistral and DeepSeek on the datasets with two approaches: schema-constrained generation and template-based.
Models are fine-tuned on machine with 8 A100 GPUs using FSDP. Training epoch is 5 with early stop.
Batch size is 4 per device and gradient accumulation step is 8, so effective batch size is 256.
To keep consistent between different models/datasets we did not use chat template, because:
1. Mistral's chat template has some limitation on the order of prompts, which doesn't fit dataset like API-Bank.
2. DeepSeek-Coder fine-tuned with chat template are producing empty results on dataset like When2Call.

The command is like following:
```commandline
TF32_CUDA_ALLOW=1 accelerate launch \
  --mixed_precision bf16 \
  --num_processes 8 \
  --num_machines 1 \
  --use_fsdp \
  --fsdp_sharding_strategy FULL_SHARD \
  --fsdp_auto_wrap_policy TRANSFORMER_BASED_WRAP \
  --fsdp_transformer_layer_cls_to_wrap ${FSDP_LAYER} \
  --fsdp_offload_params True \
  --fsdp_activation_checkpointing True \
  src/trainer.py \
  --model_name_or_path ${MODEL_NAME} \
  --truth_paths ${TRUTH_PATH} \
  --data_gen_mode  ${DATA_GEN_MODE} \
  --output_dir ${OUTPUT_DIR} \
  --log_dir ${LOG_DIR} \
  --num_train_epochs 5 \
  --per_device_train_batch_size 4 \
  --per_device_eval_batch_size 4 \
  --eval_steps 4 \
  --gradient_accumulation_steps 8 \
  --model_max_length 3000 \
  --save_total_limit 3 \
  --learning_rate 2e-5 \
  --weight_decay 0. \
  --warmup_ratio 0.03 \
  --lr_scheduler_type cosine \
  --early_stopping_threshold 0.001 \
  --early_stopping_patience 3 \
  --logging_steps 1
```
For Mistral, FSDP_LAYER is "MistralDecoderLayer" and MODEL_NAME is "mistralai/Mistral-7B-Instruct-v0.3".
For DeepSeek-Coder, FSDP_LAYER is "LlamaDecoderLayer" and MODEL_NAME is "deepseek-ai/deepseek-coder-7b-instruct-v1.5".

To fine-tune the schema-constrained generation model, set DATA_GEN_MODE to 1. To fine-tune the template-based generation model, set DATA_GEN_MODE to 2.

The TRUTH_PATH for each dataset is as following. These files are generated in previous [Convert the original data](#convert-the-original-data) step.
* API-Bank: `data/api_bank/generated/training-data/lv1-api-train.json data/api_bank/generated/training-data/lv2-api-train.json`
* ToolACE: `data/tool_ace/generated/data.json.train`
* When2Call: `data/when2call/generated/when2call_train_pref.jsonl`

#### Run tests
For each dataset(API-Bank/ToolACE/When2Call) and each model type(Mistral/DeepSeek-Coder), 
now you should have two models fine-tuned by two approaches. 
Suppose Model A is tuned for schema-constrained generation, the path to the model is DIR_TO_MODELS/MODEL_A. 
Model B is tuned for template-based generation, the path to the model is DIR_TO_MODELS/MODEL_B. 
To eval the results, using following command:

```commandline
# API-Bank
scripts/eval_local_hf_llm.sh DIR_TO_MODELS MODEL_A MODEL_B general data/api_bank/generated/test-data data/api_bank/generated/test-data level-1-api.json,level-2-api.json cuda 1

# ToolACE
scripts/eval_local_hf_llm.sh DIR_TO_MODELS MODEL_A MODEL_B general data/tool_ace/generated data/tool_ace/generated data.json.test cuda 1

# When2Call
scripts/eval_local_hf_llm.sh DIR_TO_MODELS MODEL_A MODEL_B general data/when2call/generated data/when2call/generated when2call_test_mcq.jsonl cuda 1
```
