GovBench: From Natural Language to Executable Pipelines, A New Benchmark for Data Governance Automation

Zhou Liu; ZhaoYang Han; Guochen Yan; Zeli Su; Bohan Zeng; Hao Liang; Xiaochen Ma; Yuanfeng SONG; Xing Chen; Wentao Zhang

GovBench: From Natural Language to Executable Pipelines, A New Benchmark for Data Governance Automation

Zhou Liu, ZhaoYang Han, Guochen Yan, Zeli Su, Bohan Zeng, Hao Liang, Xiaochen Ma, Yuanfeng SONG, Xing Chen, Wentao Zhang

10 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: benchmarks, agent, data governance, large language models

TL;DR: A benchmark and agentic system for automating data governance: from natural language instructions to executable data pipelines.

Abstract: Data governance is essential for scaling modern AI development. To automate data governance, numerous tools and models have emerged that translate user intent into executable governance code. However, the effectiveness of existing tools and models is largely unverified. The evaluation is severely hampered by the lack of a realistic, standardized, and quantifiable benchmark. This critical gap presents a significant obstacle to systematically evaluating utility and impedes further innovation in the field. To bridge this gap, we introduce GovBench, a benchmark featuring a diverse set of tasks with targeted noise to simulate real-world scenarios and standardized scoring scripts for reproducible evaluation. Our analysis reveals that current data governance tools and models struggle with complex, multi-step workflows and lack robust error-correction mechanisms. We therefore propose DataGovAgent, a novel framework for end-to-end data governance utilizing a Planner-Executor-Evaluator architecture. This design incorporates contract-guided planning, retrieval from a reliable operator library, and sandboxed meta-cognitive debugging. Experimental results validate our approach: DataGovAgent significantly boosts the Average Task Score (ATS) on complex Directed Acyclic Graph (DAG) tasks from 39.7 to 54.9 and reduces debugging iterations by over 77.9\% compared to general-purpose agent frameworks, a step toward more reliable automation of data governance. Code is available at https://anonymous.4open.science/r/GovBench-F6C6.

Primary Area: datasets and benchmarks

Submission Number: 3795

Loading