KRAMABENCH: A Benchmark for AI Systems on Data-to-Insight Pipelines over Data Lakes

KRAMABENCH: A Benchmark for AI Systems on Data-to-Insight Pipelines over Data Lakes

ICLR 2026 Conference Submission19634 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: data science agents, data science, data wrangling, data analysis, data management, reasoning, agentic systems

TL;DR: KRAMABENCH is a benchmark of 104 real-world pipelines showing that while LLM systems can generate code and rough plans, they fall short at building data science pipelines that work on real-world data.

Abstract: Discovering insights from a real-world data lake potentially containing unclean, semi-structured, and unstructured data requires a variety of data processing tasks, ranging from extraction and cleaning to integration, analysis, and modeling. This process often also demands domain knowledge and project-specific insight. While AI models have shown remarkable results in reasoning and code generation, their abilities to design and execute complex pipelines that solve these data-lake-to-insight challenges remain unclear. We introduce KramaBench which consists of 104 manually curated and solved challenges spanning 1700 files, 24 data sources, and 6 domains. KramaBench focuses on testing the end-to-end capabilities of AI systems to solve challenges which require automated orchestration of different data tasks. KramaBench also features a comprehensive evaluation framework assessing the pipeline design and individual data task implementation abilities of AI systems. Evaluating 8 LLMs with our single-agent reference framework DS-Guru, alongside open- and closed-source agentic systems, we find that while current single-agent systems may handle isolated data-science tasks and generate plausible draft pipelines, they struggle with producing working end-to-end pipelines. On KramaBench, the best system reaches only 50% end-to-end accuracy in the full data-lake setting. Even with perfect retrieval, the accuracy tops out at 59%. Leading LLMs can identify up to 42% of important data tasks but can only fully implement 20% of individual data tasks.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 19634

Loading