TELLER-Bench: Tool-Enhanced LLM Agent Evaluation for Real-World Banking

TELLER-Bench: Tool-Enhanced LLM Agent Evaluation for Real-World Banking

ACL ARR 2026 January Submission8746 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Banking Workflows, Benchmark, Function Calling, Multi-step Planning, Agent Evaluation, API Dependency

Abstract: Effective evaluation of multi-step tool invocation in complex workflows is critical for analyzing LLMs' planning, reasoning, and execution capabilities in real-world applications. However, progress has been limited by a lack of faithful benchmarks capturing the detailed logic of safety-critical domains such as banking. To fill this gap, we present TELLER, a benchmark containing 1,033 test instances across five banking scenarios, designed for thorough evaluation of LLMs in complex workflows.TELLER ensures realistic Standard Operating Procedure (SOP) constraints, complex API dependencies, and verifiable results through a two-stage framework including dependency graph reconstruction and end-to-end execution. We evaluate 14 LLMs across five model families (Claude, Gemini, GPT, DeepSeek, Qwen), revealing significant challenges. The leading model, Gemini-3-Pro, achieves only 38% execution accuracy, while open-source models below 32B parameters fall below 11%. Further studies reveal weaknesses in understanding tool dependencies and precise invocation, providing insights for future optimization. By establishing a high-quality benchmark for diverse banking workflows, Teller lays the groundwork for advancing LLM agent deployment in real-world financial industries.

Paper Type: Long

Research Area: AI/LLM Agents

Research Area Keywords: LLM agents, tool use, function calling, agent evaluation, planning in agents

Contribution Types: NLP engineering experiment, Data resources

Languages Studied: English

Submission Number: 8746

Loading