hdl2v: A Code Translation Dataset for Enhanced LLM Verilog Generation

Charles Hong; Brendan Roberts; Huijae An; Alex Um; Advay Ratan; Sophia Shao

hdl2v: A Code Translation Dataset for Enhanced LLM Verilog Generation

Charles Hong, Brendan Roberts, Huijae An, Alex Um, Advay Ratan, Sophia Shao

Published: 21 May 2025, Last Modified: 21 Jun 2025MLArchSys 2025 OralEveryoneRevisionsBibTeXCC BY 4.0

Presentation: In-Person

Keywords: Large language models, Verilog generation, Code translation

Presenter Full Name: Charles Hong

TL;DR: We translate VHDL, Chisel, and PyMTL3 code to Verilog to produce novel Verilog data for LLM fine-tuning, and investigate what makes one dataset better than another.

Presenter Email: charleshong@berkeley.edu

Abstract: Large language models (LLMs) are playing an increasingly large role in domains such as code generation, including hardware code generation, where Verilog is the key language. However, the amount of publicly available Verilog code pales in comparison to the amount of code available for software languages like Python. In this work, we present hdl2v ("HDL-to-Verilog"), a dataset which seeks to increase the amount of available human-written Verilog data by translating or compiling three other hardware description languages - VHDL, Chisel, and PyMTL3 - to Verilog. Furthermore, we demonstrate the value of hdl2v in enhancing LLM Verilog generation by improving performance of a 32 billion-parameter open-weight model by up to 23% (pass@10) in VerilogEvalV2, without utilizing any data augmentation or knowledge distillation from larger models. We also show hdl2v's ability to boost the performance of a data augmentation-based fine-tuning approach by 63%. Finally, we characterize and analyze our dataset to better understand which characteristics of HDL-to-Verilog datasets can be expanded upon in future work for even better performance.

Presenter Bio: Charles is a rising 4th year PhD student at the University of California, Berkeley, advised by Professor Sophia Shao. He is interested in the intersection between machine learning and computer architecture: both using hardware to accelerate machine learning and using machine learning (particularly LLMs) to accelerate hardware development.

Paper Checklist Guidelines: I certify that all co-authors have validated the presented results and conclusions, and have read and commit to adhering to the Paper Checklist Guidelines, Call for Papers and Publication Ethics.

YouTube Link: N/A

YouTube Link Poster: N/A

Dataset Release: I certify that all co-authors commit to release the dataset and necessary scripts to reproduce the presented results.

Google Slides: https://docs.google.com/presentation/d/18ELax9ouLoUT4i4u0digyzAUFrbdStWJwFOiQOQi9NE/edit?usp=sharing

Poster: Yes

Workshop Registration: Yes, the presenter has registered for the workshop.

YouTube Link Short: n/a

Submission Number: 11

Loading