Keywords: Disambiguation, Tool-calling Agents, Enterprise Tools, Synthetic multi-turn dialogues, Dynamic evaluation, Risk mitigation, Agentic function calling
Abstract: Large language models (LLMs) are increasingly tasked with invoking enterprise APIs, yet they routinely falter when near-duplicate tools vie for the same user intent or when required arguments are left underspecified. We introduce **DiaFORGE** (**Dia**logue **F**ramework for **O**rganic **R**esponse **G**eneration & **E**valuation), a disambiguation-centric, three-stage pipeline that (i) synthesizes persona-driven, multi-turn dialogues in which the assistant must distinguish among highly similar tools, (ii) performs supervised fine-tuning of open-source models with reasoning traces across 3B - 70B parameters, and (iii) evaluates real-world readiness via a dynamic suite that redeploys each model in a live agentic loop and reports end-to-end goal completion alongside conventional static metrics. On our dynamic benchmark DiaBENCH, models trained with DiaFORGE raise tool-invocation success by **27 pp over GPT-4o** and by **49 pp over Claude-3.5-Sonnet**, both under optimized prompting. To spur further research, we release an open corpus of **5000 production-grade enterprise API** specifications paired with rigorously validated, disambiguation-focused dialogues, offering a practical blueprint for building reliable, enterprise-ready tool-calling agents.
Paper Type: Long
Research Area: AI/LLM Agents
Research Area Keywords: LLM/AI agents, fine-tuning, retrieval-augmented generation, robustness, applications, evaluation and metrics, tool-use
Contribution Types: NLP engineering experiment, Data resources, Data analysis
Languages Studied: English
Submission Number: 102
Loading