Keywords: Large Language Models, Benchmark, Hardware Design Automation, Hardware Code Generation, Hardware Design Verification, AI Agent
TL;DR: The proposed benchmark introduces 783 expert-crafted RTL design and verification tasks, surpassing previous benchmarks like VerilogEval and RTLLM in complexity and scope.
Abstract: We present the XYZ benchmark [note to reviewers: name withheld in accordance with ICLR double-blind policy], a new dataset and infrastructure to advance LLM and agent research in hardware design and verification. XYZ includes 783 problems across 13 task categories, covering RTL generation, verification, debugging, specification alignment, and technical Q&A authored by experienced hardware engineers. Problems are offered in both non-agentic and agentic formats. The benchmark introduces more realistic and challenging contexts than prior work, with state-of-the-art models achieving no more than 34% pass@1 on code generation. Agentic tasks—especially those involving RTL reuse and verification—are particularly difficult. Evaluation uses open-source tools and model scoring infrastructure, with comprehension tasks assessed via BLEU and LLM-based judging. XYZ reveals substantial gaps in current model capabilities, underscoring the need for continued research toward robust, real-world hardware design automation.
Primary Area: datasets and benchmarks
Submission Number: 13630
Loading