CXMArena: Unified Dataset to benchmark performance in realistic CXM Scenarios

Raghav Garg; Karan Gupta; Kapil Sharma

CXMArena: Unified Dataset to benchmark performance in realistic CXM Scenarios

Raghav Garg, Karan Gupta, Kapil Sharma

19 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Benchmark Dataset, Customer Experience Management, CXM, Large Language Models, Synthetic Data Generation, Contact Center AI, Retrieval Augmented Generation, RAG, Intent Prediction, LLM, CXMArena

TL;DR: We introduce a large-scale synthetic benchmark dataset to evaluate AI performance on critical operational tasks in realistic Customer Experience Management (CXM) scenarios.

Abstract: Large Language Models (LLMs) hold immense potential for revolutionizing Customer Experience Management (CXM), particularly in contact center operations. However, evaluating their practical utility in complex operational environments is hindered by data scarcity (due to privacy concerns) and the limitations of current benchmarks. Existing benchmarks often lack realism, failing to incorporate deep knowledge base (KB) integration, real-world noise, or critical operational tasks beyond conversational fluency. To bridge this gap, we introduce CXMdataset, a novel, large-scale synthetic benchmark dataset specifically designed for evaluating AI in operational CXM contexts. Given the diversity in possible contact center features, we have developed a scalable LLM-powered pipeline that simulates the brand's CXM entities that form the foundation of our datasets—such as knowledge articles including product specifications, issue taxonomies, and contact center conversations. The entities closely represent real-world distribution because of controlled noise injection (informed by domain experts) and rigorous automated validation. Building on this, we release CXMdataset, which provides dedicated benchmarks targeting five important operational tasks: Knowledge Base Refinement, Intent Prediction, Agent Quality Adherence, Article Search, and Multi-turn RAG with Integrated Tools. Our baseline experiments underscore the benchmark's difficulty: even state-of-the-art embedding and generation models achieve only 68% accuracy on article search, while standard embedding methods yield a low F1 score of 0.3 for knowledge base refinement, highlighting significant challenges for current models necessitating complex pipelines and solutions over conventional techniques.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 16376

Loading