Refining Specs For LLM-Based RTL Agile Design

Sirui He; Chujie Chen; Xiang Zheng; Zhihang Liu; Cong Wang

Refining Specs For LLM-Based RTL Agile Design

Sirui He, Chujie Chen, Xiang Zheng, Zhihang Liu, Cong Wang

17 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: prompt engineering, LLM assisted design, RTL design, benchmark

TL;DR: LLM-refined specs can help LLM RTL design agents to achieve over 90% pass@5 rate on RTLLM and VerilogEval benchmarks.

Abstract: Large language models (LLMs) are increasingly employed to assist in agile register-transfer-level (RTL) hardware design. This is a labor-intensive stage in developing FPGA-based acceleration services or prototyping ASICs, and successful automation can largely shorten the development cycle. However, benchmarks are reporting a relatively low functional correctness rate (sometimes called accuracy) when generating simple modules of less than 100 lines of code (LOC, in Verilog), questioning the practicality of current LLMs for real-world designs. This paper highlights that the low accuracy is attributed to the use of low-quality descriptions as prompts in both training datasets and benchmarks. First, the natural language descriptions (NLDs) do not contain all the semantics constrained by the testbenches (TB), causing false negatives during verification. Second, existing automatically generated NLDs are usually too detailed in implementation, which is not suitable for both training and benchmarking. We designed tools to quantify the clarity and simplicity of the cases, improve the quality of existing and future LLM-for-RTL datasets, and assist agile RTL designers in creating qualified specifications (specs, i.e., formatted and complete NLDs). We show by experiment that LLMs can create specs with high quality at a low cost. Additionally, when equipped with these specs, general-purpose LLMs can achieve a high pass@5 rate (up to 89\% on RTLLM, 96\% on VerilogEval-Human) without requiring expensive fine-tuning or post-generation self-fixing.

Primary Area: datasets and benchmarks

Submission Number: 8428

Loading