Cloning the Unshareable: Agentic AI for Synthesizing Open, Production-Faithful Datacenter Benchmarks

Published: 09 Mar 2026, Last Modified: 09 Mar 2026Architecture 2.0 2026EveryoneRevisionsCC BY 4.0
Keywords: Agentic AI, Benchmarking, Datacenters
TL;DR: Designing new hardware for datacenters is hard due to the lack of representative benchmarks, in this work, we propose an agentic AI framework to clone the production-grade workloads without exposing their proprietary information.
Abstract: Modern processors are designed and validated using benchmarks, yet the workloads that matter most in datacenters are rarely available outside the companies that run them. Confidentiality prevents operators from open-sourcing production services in full fidelity, while public benchmark suites and hand-written proxies fail to reproduce the performance-critical properties of real deployments. This benchmark gap distorts architectural conclusions and can misguide the design of next-generation CPUs. This paper argues that agentic AI enables a new path forward: automated, privacy-preserving workload cloning that turns production telemetry into public benchmarks that are faithful to the original workload but do not reveal proprietary code or data. We focus specifically on benchmark generation as a first-class problem, and argue that it requires an end-to-end, closed-loop synthesis workflow rather than static proxy design. In our envisioned approach, AI agents (i) ingest high-level workload descriptions along with performance-counter traces, system-call logs, and resource-usage time series, (ii) infer an interpretable workload model that captures the dominant phase structure and bottlenecks, (iii) synthesize benchmark code and inputs that realize this model, and (iv) iteratively validate and refine the result until it matches target signatures under controlled measurement. We outline key research challenges and opportunities for agentic benchmark cloning, including defining fidelity metrics that correlate with architectural decisions, ensuring robustness across compiler and microarchitecture changes, and providing explicit privacy guarantees. If successful, this would let operators safely share representative benchmarks and let architects evaluate designs on realistic, reproducible, and continuously changing workloads---closing the loop between production behavior and processor design.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 6
Loading