Keywords: large language models, memorization, retrieval-augmented generation, compositional reasoning, FPGA timing closure, electronic design automation, deep generative models, chain-of-thought prompting, regime attribution, static timing analysis
TL;DR: We reveal that LLMs "solving" FPGA timing violations are mostly pattern-matching from training data, not reasoning, and build a diagnostic framework that catches which mode is actually driving each answer.
Abstract: Large Language Models (LLMs) have recently been applied to Electronic Design Automation (EDA), yet a fundamental question remains open: when an LLM successfully diagnoses a timing violation or proposes a constraint fix, is it memorizing a seen pattern, retrieving externally grounded knowledge, or reasoning compositionally over novel problem structure? This distinction determines reliability, generalizability, and trustworthiness in safety-critical hardware design flows. We present a systematic empirical study using TimingLLM, an LLM-plus-RAG framework for FPGA static timing analysis and automated timing closure, as a controlled empirical testbed for probing these three regimes. Through controlled ablations across 658 timing violations spanning 12 industrial-scale FPGA designs, we find that: (1) memorization accounts for approximately 68% of correct diagnoses on high-prevalence violation types under our attribution metric; (2) retrieval-augmented grounding is essential for rare but consequential violations, recovering 29 F1 points lost in ablation; and (3) compositional reasoning emerges only on multi-constraint scenarios, where chain-of-thought prompting improves fix success rate by 31% over retrieval alone at depth k=4. We introduce the Timing Reasoning Spectrum (TRS), a formal taxonomy and evaluation benchmark for characterizing LLM reasoning depth in structured, domain-specific workflows, and propose it as a standard diagnostic tool for future work on LLMs in scientific and engineering discovery contexts.
Submission Number: 4
Loading