Potemkin Understanding in Large Language Models

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We find that large language models exhibit a category of failure in conceptual comprehension called potemkin understanding.
Abstract: Large language models (LLMs) are regularly evaluated using benchmark datasets. But what justifies making inferences about an LLM's capabilities based on its answers to a curated set of questions? This paper first introduces a formal framework to address this question. The key is to note that the benchmarks used to test LLMs---such as AP exams---are also those used to test people. However, this raises an implication: such benchmarks are only valid tests if LLMs misunderstand concepts in ways that mirror human misunderstandings. Otherwise, success on benchmarks only demonstrates **potemkin understanding:** the illusion of understanding driven by answers irreconcilable with how any human would interpret a concept. We present two procedures for quantifying the existence of potemkins: one using a specially designed benchmark in three domains, the other using a general procedure that provides a lower-bound on their prevalence. We find that potemkins are ubiquitous across models, tasks, and domains. We also find that these failures reflect not just incorrect understanding, but deeper internal incoherence in concept representations.
Lay Summary: Large language models (LLMs), like ChatGPT, are typically assessed using benchmarks—standardized tests similar to those used for humans. Our research reveals an implicit assumption in this approach: that LLMs who score well possess the same capabilities as humans who do. If this isn't true, good benchmark performance reflects what we term “potemkin understanding,” where models correctly answer benchmark questions but fail simpler tasks any human with true conceptual understanding would handle easily. We developed two methods to detect potemkin understanding. The first method involves creating a dataset of potemkins in three areas: literary techniques, game theory, and psychological biases. The second method automatically identifies potemkins without needing human-labeled data. Applying these methods to various models, we found potemkin understanding to be widespread. Identifying potemkin understanding challenges the validity of current AI evaluations, helping distinguish superficial correct answers from true conceptual understanding and guiding the development of more reliable, intelligent models.
Link To Code: https://github.com/MarinaMancoridis/PotemkinBenchmark
Primary Area: Deep Learning->Large Language Models
Keywords: evaluation, benchmark, large language models, potemkin understanding
Submission Number: 12975
Loading