The Berkeley Function Calling Leaderboard (BFCL): From Tool Use to Agentic Evaluation of Large Language Models
Abstract: Function calling, also called tool use, refers to an LLM's ability to invoke external functions, APIs, or user-defined tools in response to user queries—an essential capability for agentic LLM applications. Despite its prominence, there did not exist a standard benchmark to evaluate function calling abilities, due to two reasons – the challenging nature of evaluating when a function call is valid, and the challenge of acquiring diverse, real-world functions. We present the Berkeley Function Calling Leaderboard (BFCL), a comprehensive benchmark designed to evaluate function calling capabilities in a wide range of real-world settings. The BFCL benchmark evaluates serial and parallel function calls, across various programming languages using a novel Abstract Syntax Tree (AST) evaluation method that can easily scale to thousands of functions. We construct the benchmark using a combination of expert curated, and user-contributed functions and associated prompts. Finally, BFCL benchmark evaluates the ability of models to abstain and reason in stateful multi-step agentic setting. Evaluating a wide range of models, we observe that while state-of-the-art LLMs excel at singleturn calls, memory, dynamic decision-making, and long-horizon reasoning remain open challenges. Since its preview, BFCL has become the defacto standard for evaluating function-calls, and can be accessed at gorilla.cs.berkeley.edu/leaderboard.html.
Lay Summary: Imagine a helpful AI that can look up the weather, book a flight, or crunch numbers for you simply by “calling” the right online tool or app—much like a human who opens the correct website or piece of software on command. Today’s large language models (LLMs) are starting to do exactly that, but the community lacked a fair, comprehensive, and real-world evaluation to measure how well they perform these tool-using skills.
Our work introduces the Berkeley Function Calling Leaderboard (BFCL), a public “obstacle course” that puts AIs through thousands of real-world tasks. Some tests are short and simple (one-off questions), others mimic back-and-forth conversations, and the most demanding ones ask the AI to juggle memory, reasoning and multiple steps—just as an autonomous assistant would in daily life. To grade the answers quickly and reliably, we created a novel checking method that looks at the structure of each tool call rather than running every tool for real, letting the benchmark scale to thousands of functions.
Early results reveal a split personality: top AIs ace the one-shot questions but still stumble when they must remember context, manage long conversations, or decide when not to act. By spotlighting these gaps, the BFCL gives researchers and companies a clear target for the next generation of more capable, trustworthy AI assistants. Anyone can explore the live leaderboard online and watch the field improve in real time.
Link To Code: https://github.com/ShishirPatil/gorilla/tree/main/berkeley-function-call-leaderboard
Primary Area: Deep Learning->Large Language Models
Keywords: Function Calling Evaluation, Tool use, Agentic Evaluation, Large Language Models
Submission Number: 9959
Loading