HDL-FixBench: A Verifiable Repository-Level Benchmark for Hardware bug repair

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Hardware Engineering, Large Language Model, Electronic Design Automation, Benchmark
TL;DR: We introduce HDL-FixBench, the first benchmark for evaluating LLMs on repository-level hardware bug repair tasks
Abstract: Existing benchmarks for hardware design primarily assess Large Language Models (LLMs) on isolated, component-level Hardware Description Language (HDL) code generation from specifications, overlooking the critical challenge of repository-scale bug repair. To address this gap, we introduce HDL-FixBench, the first benchmark for repository-level hardware bug repair. It comprises 57 high-fidelity instances curated from three industry-standard open-source hardware projects: OpenTitan, CVA6, and Ibex. Each instance is curated through a rigorous methodology, combining a novel agent-based filtering pipeline with meticulous manual verification, and is accompanied by a fully reproducible, containerized EDA environment to ensure task quality and relevance. Evaluating seven state-of-the-art LLMs with two prominent agent frameworks(SWE-Agent and OpenHands) on HDL-FixBench, we find that even the most advanced models perform significantly worse than on SWE-bench Verified, with the top-performing model resolving only 40.3\% of tasks. This finding highlights the unique complexities of hardware engineering and establishes HDL-FixBench as a challenging and crucial benchmark for advancing the next generation of automated hardware design and verification tools.
Primary Area: datasets and benchmarks
Submission Number: 24163
Loading