Can LLMs Design Real Hardware? A New Benchmark for RTL Design and Verification Tasks

Nathaniel Pinckney; Chenhui Deng; Chia-Tung Ho; Yun-Da Tsai; Mingjie Liu; Wenfei Zhou; Brucek Khailany; Haoxing Ren

Can LLMs Design Real Hardware? A New Benchmark for RTL Design and Verification Tasks

Nathaniel Pinckney, Chenhui Deng, Chia-Tung Ho, Yun-Da Tsai, Mingjie Liu, Wenfei Zhou, Brucek Khailany, Haoxing Ren

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Benchmark, Hardware Design Automation, Hardware Code Generation, Hardware Design Verification, AI Agent

TL;DR: The proposed benchmark introduces 783 expert-crafted RTL design and verification tasks, surpassing previous benchmarks like VerilogEval and RTLLM in complexity and scope.

Abstract: We present the XYZ benchmark [note to reviewers: name withheld in accordance with ICLR double-blind policy], a new dataset and infrastructure to advance LLM and agent research in hardware design and verification. XYZ includes 783 problems across 13 task categories, covering RTL generation, verification, debugging, specification alignment, and technical Q&A authored by experienced hardware engineers. Problems are offered in both non-agentic and agentic formats. The benchmark introduces more realistic and challenging contexts than prior work, with state-of-the-art models achieving no more than 34% pass@1 on code generation. Agentic tasks—especially those involving RTL reuse and verification—are particularly difficult. Evaluation uses open-source tools and model scoring infrastructure, with comprehension tasks assessed via BLEU and LLM-based judging. XYZ reveals substantial gaps in current model capabilities, underscoring the need for continued research toward robust, real-world hardware design automation.

Primary Area: datasets and benchmarks

Submission Number: 13630

Loading