Keywords: Data Agent, Code Generation, LLM Benchmark
TL;DR: We introduce DAComp, the first benchmark for the full data intelligence lifecycle.
Abstract: Real-world enterprise data intelligence workflows encompass data engineering that turns raw sources into analytical-ready tables and data analysis that convert those tables into decision-oriented insights.
We introduce DAComp, a benchmark of 236 tasks that mirrors these complex workflows.
Data engineering (DE) tasks require repository-level engineering on industrial schemas, including designing and building multi-stage SQL pipelines from scratch and evolving existing systems under evolving requirements.
Data analysis (DA) tasks pose open-ended business problems that demand strategic planning, exploratory analysis through iterative coding, interpretation of intermediate results, and the synthesis of actionable recommendations.
Engineering tasks are scored through execution-based, multi-metric evaluation.
Open-ended tasks are assessed by a reliable, experimentally validated LLM-judge, which is guided by hierarchical, meticulously crafted rubrics.
Our experiments reveal that even state-of-the-art agents falter on DAComp. Performance on DE tasks is particularly low, with success rates under 10\%, exposing a critical bottleneck in holistic pipeline orchestration, not merely code generation. Scores on DA tasks also average below 40\%, highlighting profound deficiencies in open-ended reasoning and demonstrating that engineering and analysis are distinct capabilities.
By clearly diagnosing these limitations, DAComp provides a rigorous and realistic testbed to drive the development of truly capable autonomous data agents for enterprise settings.
Our data and code are available at \url{https://anonymous.4open.science/r/DAComp-397A}.
Primary Area: datasets and benchmarks
Submission Number: 3303
Loading