Instance

Array
(
    [title] => Recent Forum Threads
    [title_url] => 
    [ignore_sticky] => 0
    [exclude_current] => 0
    [limit] => 10
    [sluglist] => ["jobs-dashboard"]
    [rw_opt] => Array
        (
            [widget_select] => 1
            [pageid_281769] => 1
            [pageid_281772] => 1
        )

    [display_widget_mobile] => 
    [rw_opt_exclude] => Array
        (
            [pageid_274493] => 1
            [cpt_podcast] => 1
            [cpta_podcast] => 1
            [category_16613] => 1
            [category_16631] => 1
            [taxonomy_series] => 1
            [pageid_354254] => 1
        )

    [node_id] => Array
        (
            [0] => 2
        )

)

Threads

Recent Article Comments

Podcast EP357: How Gonka is Changing the Way AI is Accessed with David Liberman
Fascinating on so many levels. As an Apple fanboy, I wonder if the rumored M7 with Broadcom ASICs, aka Baltra…

— Fred Stein on July 26, 2026
Podcast EP357: How Gonka is Changing the Way AI is Accessed with David Liberman
Sent.........

— Daniel Nenni on July 26, 2026
Podcast EP357: How Gonka is Changing the Way AI is Accessed with David Liberman
Hi Daniel, Do you a transcript of this podcast? Thanks

— Fred Stein on July 26, 2026
TSMC CoWoS versus Intel EMIB Semiconductor Packaging
I think the picture is bit of wrong for the scalability EMIB mentioned as 6X in 26 and CoWoS-L is…

— siliconbruh999 on July 17, 2026
Consolidation and Competition: Who is Winning the $4.5 Billion Interface IP Race?
HPC can be Chiplet. Wondering why UCIe is not considered. Internally AMBA neither

— chiro.lentz on July 11, 2026
The Packaging PDK Is the Missing Layer for Co-Packaged Optics
Thank you to Daniel Nenni and SemiWiki for publishing my latest article: The Packaging PDK Is the Missing Layer for…

— moh.kolb on July 8, 2026
The Packaging PDK Is the Missing Layer for Co-Packaged Optics
Very interesting. Thanks.

— U235 on July 8, 2026
Why Huawei Says It Will Match TSMC’s Most Advanced Chips by 2031
N+3 is denser than N6: https://newsletter.semianalysis.com/p/steel-smic-n3-teardown?open=false

— Fred Chen on July 5, 2026
Why Huawei Says It Will Match TSMC’s Most Advanced Chips by 2031
Fixed, thank you.

— Daniel Nenni on July 4, 2026
Why Huawei Says It Will Match TSMC’s Most Advanced Chips by 2031
The article is not correct. EUV equipment is not primarily produced by ASML. It is only produced by ASML. It…

— AndyG on July 4, 2026

WP_Term Object
(
    [term_id] => 6435
    [name] => AI
    [slug] => artificial-intelligence
    [term_group] => 0
    [term_taxonomy_id] => 6435
    [taxonomy] => category
    [description] => Artificial Intelligence
    [parent] => 0
    [count] => 876
    [filter] => raw
    [cat_ID] => 6435
    [category_count] => 876
    [category_description] => Artificial Intelligence
    [cat_name] => AI
    [category_nicename] => artificial-intelligence
    [category_parent] => 0
)

October 2, 2024October 14, 2024 by Bernard Murphy

Is AI-Based RTL Generation Ready for Prime Time?

Is AI-Based RTL Generation Ready for Prime Time?
by Bernard Murphy on 10-02-2024 at 6:00 am
Categories: AI, EDA

In semiconductor design there has been much fascination around the idea of using large language models (LLMs) for RTL generation; CoPilot provides one example. Based on a Google Scholar scan, a little over 100 papers were published in 2023, jumping to 310 papers in 2024. This is not surprising. If it works, automating design creation could be a powerful advantage to help designers become more productive (not to replace them as some would claim). But we know that AI claims have a tendency to run ahead of reality in some areas. Where does RTL generation sit on this spectrum?

Benchmarking

The field has moved beyond the early enthusiasm of existence proofs (“look at the RTL my generator built”) to somewhat more robust analysis. A good example is a paper published very recently in arXiv: Revisiting VerilogEval: Newer LLMs, In-Context Learning, and Specification-to-RTL Tasks, with a majority of authors from Nvidia and one author from Cornell. A pretty authoritative source.

The authors have extended a benchmark (VerilogEval) they built in 2023 to evaluate LLM-based Verilog generators. The original work studied code completion tasks; in this paper they go further to include generating block RTL from natural language specifications. They also describe a mechanism for prompt tuning through in-context learning (additional guidance in the prompt). Importantly for both completion and spec to RTL they provide a method to classify failures by type, which I think could be helpful to guide prompt tuning.

Although there is no mention of simulation testbenches, the authors clearly used a simulator (Icarus Verilog) and talk about Verilog compile-time and run-time errors, so I assume the benchmark suite contains human-developed testbenches for each test.

Analysis

The authors compare performance across a wide range of LLMs, from GPT-4 models to Mistral, Llama, CodeGemma, DeepSeek Coder and RTLCoder DeepSeek. Small point of initial confusion for this engineer/physicist: they talk about temperature settings in a few places. This is a randomization factor for LLMs, nothing to do with physical temperature.

First, a little background on scoring generated code. The usual method to measure machine generated text is a score called BLEU (Bilingual evaluation understudy), intended to correlate with human-judged measures of quality/similarity. While appropriate for natural language translations, BLEU is not ideal for measuring code generation. Functional correctness is a better starting point, as measured in simulation.

The graphs/tables in the paper measure pass rate against a benchmark suite of tests, allowing one RTL generation attempt per test (pass@1), so no allowance for iterated improvement except in 1-shot refinement over 0-shot. 0-shot measures generation from an initial prompt and 1-shot measures generation from the initial prompt augmented with further guidance. The parameter ‘n’ in the tables is a wrinkle to manage variance in this estimate – higher n, lower variance.

Quality, measured through test pass rates within the benchmark suite, ranges from below 10% to as high as 60% in some cases. Unsurprisingly smaller (LLM) models don’t do as well as bigger models. Best rates are for GPT-4 Turbo with ~1T parameters and Llama 3.1 with 405B parameters. Within any given model, success rates for code completion and spec to RTL tests are roughly comparable. In many cases in-context learning/refined prompts improve quality, though for GPT-4 Turbo spec-to-RTL and Llama3 70B prompt engineering actually degrades quality.

Takeaways

Whether for code completion or spec to RTL, these accuracy rates suggest that RTL code generation is still a work in process. I would be curious to know how an entry-level RTL designer would perform against these standards.

Also in this paper I see no mention of tests for synthesizability or PPA. (A different though smaller benchmark, RTLLM, also looks at these factors, where PPA is determined in physical synthesis I think – again short on details.)

More generally I also wonder about readability and debuggability. Perhaps here some modified version of the BLEU metric versus expert-generated code might be useful as a supplement to these scores.

Nevertheless, interesting to see how this area is progressing.

Share this post via:

Comments

There are no comments yet.

You must register or log in to view/post comments.

Podcast EP357: How Gonka is Changing the Way AI is Accessed with David Liberman
Fascinating on so many levels. As an Apple fanboy, I wonder if the rumored M7 with Broadcom ASICs, aka Baltra…

— Fred Stein on July 26, 2026
Podcast EP357: How Gonka is Changing the Way AI is Accessed with David Liberman
Sent.........

— Daniel Nenni on July 26, 2026
Podcast EP357: How Gonka is Changing the Way AI is Accessed with David Liberman
Hi Daniel, Do you a transcript of this podcast? Thanks

— Fred Stein on July 26, 2026
TSMC CoWoS versus Intel EMIB Semiconductor Packaging
I think the picture is bit of wrong for the scalability EMIB mentioned as 6X in 26 and CoWoS-L is…

— siliconbruh999 on July 17, 2026
Consolidation and Competition: Who is Winning the $4.5 Billion Interface IP Race?
HPC can be Chiplet. Wondering why UCIe is not considered. Internally AMBA neither

— chiro.lentz on July 11, 2026
The Packaging PDK Is the Missing Layer for Co-Packaged Optics
Thank you to Daniel Nenni and SemiWiki for publishing my latest article: The Packaging PDK Is the Missing Layer for…

— moh.kolb on July 8, 2026
The Packaging PDK Is the Missing Layer for Co-Packaged Optics
Very interesting. Thanks.

— U235 on July 8, 2026
Why Huawei Says It Will Match TSMC’s Most Advanced Chips by 2031
N+3 is denser than N6: https://newsletter.semianalysis.com/p/steel-smic-n3-teardown?open=false

— Fred Chen on July 5, 2026

Search Semiwiki

Recent Forum Threads

Recent Article Comments

Recent Podcast Episodes

Benchmarking

Analysis

Takeaways

Comments

Recent Forum Threads

Recent Article Comments