Keywords: software engineering, generative AI, AI agents, configuration
TL;DR: We present a large-scale public dataset of configuration artifacts for agentic AI coding tools, collected from more than 4,000 open-source repositories to enable research on context engineering, AI tool usage, and human-AI collaboration.
Abstract: Developers increasingly rely on agentic AI coding tools such as Claude Code and OpenAI Codex, which autonomously plan, execute, and iterate on coding tasks. To steer these tools, developers create repository-level configuration artifacts (e.g., Markdown files) for configuration mechanisms such as Context files, Skills, Rules, and Hooks. There is no curated dataset that captures these configurations at scale. We address this gap by presenting a dataset of agentic AI coding tool configurations collected from open-source GitHub repositories. We selected 40,585 actively maintained repositories through metadata filtering, classified them using GPT-5.2 to identify 36,710 engineered software projects, and systematically detected configuration artifacts in these repositories. The dataset covers 4,741 repositories across five tools (Claude Code, GitHub Copilot, OpenAI Codex, Cursor, Gemini) and eight configuration mechanisms. We collected 15,612 configuration artifacts, the full content of 45,126 configuration files associated with configuration artifacts, and 148,551 AI-co-authored commits. The dataset and the complete construction pipeline are publicly available on Zenodo under CC BY 4.0. An interactive website allows researchers to browse and explore the data. This data supports research on context engineering, tool adoption patterns, and human-AI collaboration.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 28
Loading