TRACEBench: Personalized Function Calling Benchmark Based on Real-World Human Interaction

15 Sept 2025 (modified: 01 Dec 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Personalization, Tool Use, LLM Benchmarks, Large Language Models
Abstract: Function calling has emerged as a central paradigm for augmenting the capabilities of Large Language Models (LLMs) by enabling them to transcend inherent limitations, particularly in accessing real-time information. Personalized tool utilization is essential for LLMs to adaptively select and invoke tools based on individual user profiles and historical interactions. However, most current benchmarks primarily rely on LLMs to simulate user interaction histories rather than using real-world interaction data, and these histories are typically short in length, offering limited evaluation of the model's ability to understand long-context inputs. In this work, we introduce TRACEBench, a benchmark designed to evaluate LLMs’ function calling capabilities in terms of tool, parameter, and temporal context personalization. A significant difference from prior work is our data sourcing strategy: \textit{TRACEBench is built upon authentic user interaction histories collected from human volunteers}, which provides a realistic foundation of user behavior and has been anonymized to protect user privacy. Furthermore, we build long tool-use records to facilitate the evaluation and optimization of tool-augmented models' long-context understanding capabilities. We perform reverse generation of user instructions from target tool calls, varying the level of instruction specificity to simulate different degrees of personalization. Extensive experiments offer insights into improving personalized LLM agents. Our code is available\footnote{\url{https://anonymous.4open.science/r/TRACEBench-5CC4}}.
Primary Area: datasets and benchmarks
Submission Number: 5788
Loading