Keywords: LLM evaluation, Conceptual knowledge boundaries, Multi-agent game
TL;DR: A benchmark framework based on the UNDERCOVER game that evaluates LLMs' ability to recognize knowledge boundaries and perform strategic reasoning.
Abstract: Concepts are generalized abstractions that allow humans to categorize and reason efficiently. Whether Large Language Models (LLMs) possess a similar understanding of conceptual relationships, however, is not yet well established. Existing benchmarks primarily focus on factual recall or narrow tasks (\textit{e.g.}, multiple-choice question answering or knowledge quizzes), offering limited insight into whether models understand conceptual relationships and subtle distinctions(\textit{e.g.}, poetry \textit{vs.} prose). Many also rely on static datasets that risk overfitting. To address this gap, we introduce CK-Arena, a multi-agent interaction benchmark inspired by the Undercover game, designed to evaluate the mastery of conceptual feature knowledge by LLMs. In CK-Arena, models must describe, differentiate, and infer distinguishing features of concepts from partial information, testing their ability to reason about both commonalities and differences across concept boundaries. The benchmark offers scalable datasets, rigorous evaluation protocols, and flexible extension methods, enabling comprehensive assessment of LLMs’ conceptual understanding across multiple dimensions. Experimental results show that LLMs' understanding of conceptual knowledge varies significantly across different categories and is not strictly aligned with general model capabilities. The code is made publicly available at:https://anonymous.4open.science/r/CK-Arena/readme.md.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 15009
Loading