Top of the CLASS: Benchmarking LLM Agents on Real-World Enterprise Tasks

Published: 05 Mar 2025, Last Modified: 16 Apr 2025BuildingTrustEveryoneRevisionsBibTeXCC BY 4.0
Track: Tiny Paper Track (between 2 and 4 pages)
Keywords: agents, llms, benchmarks, enterprise, conversational, workflows
TL;DR: A benchmark for evaluating LLM agents on real-world enterprise chatbot conversations across 7 diverse use cases.
Abstract: Enterprises are increasingly adopting AI agents based on large language models (LLMs) for mission-critical workflows. However, most existing benchmarks use synthetic or consumer-oriented data, and do not holistically evaluate agents on operational concerns beyond accuracy (e.g. cost, security, etc.). To address these gaps we propose CLASSIC, a novel benchmark containing 2,133 real-world user-chatbot conversations and 423 workflows across 7 enterprise domains including IT, HR, banking, and healthcare. We evaluate LLMs across five key metrics -- Cost, Latency, Accuracy, Stability, and Security -- on a multiclass classification task that requires the model to select the proper workflow to trigger in response to a user message. Our dataset of real-world conversations is challenging, with the best LLM achieving an overall accuracy of only 76.1%. Across all five metrics, we find significant variation in performance -- for example, Gemini 1.5 Pro only refuses 78.5% of our jailbreak prompts compared to Claude 3.5 Sonnet's 99.8%, while GPT-4o costs 5.4x more than the most affordable model we evaluate. We hope that our benchmark helps to increase trust in LLM applications by better grounding evaluations in real-world enterprise data. We open source our code and data, and welcome contributions from the community.
Submission Number: 109
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview