ErrorMap and ErrorAtlas: Charting the Failure Landscape of Large Language Models

Shir Ashury-Tahan; Yifan Mai; Elron Bandel; Michal Shmueli-Scheuer; Leshem Choshen

ErrorMap and ErrorAtlas: Charting the Failure Landscape of Large Language Models

Shir Ashury-Tahan, Yifan Mai, Elron Bandel, Michal Shmueli-Scheuer, Leshem Choshen

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLMs, Error Analysis, Taxonomy

Abstract: Large Language Models (LLM) benchmarks tell us when models fail, but not why they fail. A wrong answer on a reasoning dataset, for instance, may not reflect weak reasoning at all, but instead a formatting slip, a calculation error, or dataset noise. Without disentangling such causes, benchmarks give an incomplete picture and cannot reliably guide model improvement. We introduce ErrorMAp, the first method to systematically chart the sources of LLM failure. ErrorMAp provides tools to extract a model's unique ``failure signature'', uncover what benchmarks actually measure in practice, and broaden the scope of identified model errors to reduce blind spots. This enables developers to debug models more effectively and helps benchmark creators align dataset goals with actual outcomes. Additionally, it supports benchmark consumers in identifying which models best suit their specific needs. ErrorMAp is designed to work flexibly with any model and dataset, making it adaptable to evolving architectures and emerging data sources without requiring changes to its logic. We apply our method across 21 datasets and 73 models to automatically generate ErrorAtlas, a taxonomy of model errors, revealing recurring failure patterns in current language models. ErrorAtlas highlights error types that are currently underexplored in LLM research, such as omissions of required details in the output and question misinterpretation. By shifting focus from where models succeed to why they fail, ErrorMAp and ErrorAtlas lay the foundation for next-generation evaluation --- one that exposes hidden weaknesses and directs meaningful progress. Unlike success, which is typically measured using task- or dataset-level metrics, our approach introduces a deeper layer of evaluation that can be applied globally across models and tasks, offering richer insights into model behavior and limitations. We make the taxonomy and method code publicly available, with plans to update ErrorAtlas as new benchmarks emerge.

Primary Area: interpretability and explainable AI

Submission Number: 11857

Loading