Instance

Array
(
    [title] => Recent Forum Threads
    [title_url] => 
    [ignore_sticky] => 0
    [exclude_current] => 0
    [limit] => 10
    [sluglist] => ["jobs-dashboard"]
    [rw_opt] => Array
        (
            [widget_select] => 1
            [pageid_281769] => 1
            [pageid_281772] => 1
        )

    [display_widget_mobile] => 
    [rw_opt_exclude] => Array
        (
            [pageid_274493] => 1
            [cpt_podcast] => 1
            [cpta_podcast] => 1
            [category_16613] => 1
            [category_16631] => 1
            [taxonomy_series] => 1
            [pageid_354254] => 1
        )

    [node_id] => Array
        (
            [0] => 2
        )

)

Threads

Recent Forum Threads

Former Intel CEO addresses concerns of US selling NVIDIA chips to China

latest reply by tomatoma on December 24, 2025

started by Daniel Nenni on December 22, 2025
Intel at Barclays 2025: Roadmap, manufacturing, demand and margins under pressure

latest reply by Daniel Nenni on December 24, 2025

started by Fred Chen on December 23, 2025
Advanced Chips to China could cripple Chinese suppliers?

latest reply by bilau on December 24, 2025

started by Arthur Hanson on December 23, 2025
Samsung Stands to Benefit from TSMC’s 2nm Expansion Limits

latest reply by Daniel Nenni on December 23, 2025

started by Daniel Nenni on December 23, 2025
Inside Intel’s new Arizona fab, where the chipmaker’s fate hangs in the balance

latest reply by Xebec on December 23, 2025

started by siliconbruh999 on December 19, 2025
Silicon photonics won’t matter ‘anytime soon’ says Broadcom CEO

latest reply by blueone on December 23, 2025

started by Daniel Nenni on December 23, 2025
Governments welcomed data centers. Now they’re grappling with the fallout

latest reply by Barnsley on December 23, 2025

started by Daniel Nenni on December 23, 2025
Humanoid Robot Showed Off Its Fighting Abilities By Kicking The Company's CEO

latest reply by Daniel Nenni on December 23, 2025

started by Daniel Nenni on December 22, 2025
AI boom drives data-center dealmaking to record high, says report

latest reply by Xebec on December 23, 2025

started by Daniel Nenni on December 22, 2025
Can Intel recover even part of their past dominance?

latest reply by blueone on December 22, 2025

started by Arthur Hanson on November 10, 2025

Recent Article Comments

Quantum Computing Technologies and Challenges
There is a recent research paper that says it could be cracked with less 'cat' qubits (as in Schrodinger's cat).…

— Bernard Murphy on December 21, 2025
Quantum Computing Technologies and Challenges
You're welcome Fred! I equally enjoy your posts though I don't always understand the details :) Good question on qubit…

— Bernard Murphy on December 19, 2025
Quantum Computing Technologies and Challenges
Thanks Bernard for the shout-out and the chip details! So it looks like the qubits/mm2 is not very high, but…

— Fred Chen on December 18, 2025
The Quantum Threat: Why Industrial Control Systems Must Be Ready and How PQShield Is Leading the Defense
While you upgrade your chip to post-quantum, how about also adding security by design to protect the software (e.g. using…

— ExpensiveSand on December 15, 2025
Jensen Huang Drops Donald Trump Truth Bomb on Joe Rogan Podcast
Jensen is an international business expert who knows how to be successful with different cultures and different leadership styles. That…

— Daniel Nenni on December 8, 2025
Jensen Huang Drops Donald Trump Truth Bomb on Joe Rogan Podcast
Important for Jensen to praise Trump to keep things like this happening. Nvidia jumps amid report Commerce Dept. to open…

— KevinK on December 8, 2025
Website Developers May Have Most to Fear From AI
Thanks! I too am finding it interesting to speculate outside my normal bounds 😀

— Bernard Murphy on December 1, 2025
Website Developers May Have Most to Fear From AI
Nice read, and it's good to see more variety on semiwiki. Websites are surely going to serve a different set…

— Debamitro Chakraborti on December 1, 2025
TSMC Formally Sues Ex-SVP Over Alleged Transfer of Trade Secrets to Intel
If the TSMC Media Statement is true I feel it is a betrayal. If not, as I said, jumping companies…

— Daniel Nenni on November 29, 2025
TSMC Formally Sues Ex-SVP Over Alleged Transfer of Trade Secrets to Intel
"Not to mention the shame of betraying Taiwan’s most valued company?" This statement is somewhat questionable. Changing an employer, obviously,…

— lilo777 on November 29, 2025

WP_Term Object
(
    [term_id] => 6435
    [name] => AI
    [slug] => artificial-intelligence
    [term_group] => 0
    [term_taxonomy_id] => 6435
    [taxonomy] => category
    [description] => Artificial Intelligence
    [parent] => 0
    [count] => 752
    [filter] => raw
    [cat_ID] => 6435
    [category_count] => 752
    [category_description] => Artificial Intelligence
    [cat_name] => AI
    [category_nicename] => artificial-intelligence
    [category_parent] => 0
)

October 22, 2025November 11, 2025 by Bernard Murphy

Learning from In-House Datasets

Learning from In-House Datasets
by Bernard Murphy on 10-22-2025 at 6:00 am
Categories: AI

At a DAC Accellera panel this year there was some discussion on cross-company collaboration in training. The theory is that more collaboration would mean a larger training set and therefore higher accuracy in GenAI (for example in RTL generation). But semiconductor companies are very protective of their data and reports of copyrighted text being hacked out of chatbots do nothing to allay their concerns. Also, does evidence support that more mass training leads to more effective GenAI? GPT5 is estimated to have been trained on 70 trillion tokens versus GPT4 at 13 trillion tokens, yet GPT5 is generally viewed as unimpressive, certainly not a major advance on the previous generation. Maybe we need a different approach.

More training or better focused training?

A view gathering considerable momentum is that while LLMs do an excellent job in understanding natural language, domain-specific expertise is better learned from in-house data. While this data is obviously relevant, clearly there’s a lot less of it than in the datasets used to train big GenAI models. A more thoughtful approach is necessary to learn effectively from this constrained dataset.

Most/all approaches start with a pre-trained model (the “P” in GPT) since that already provides natural language understanding and a base of general knowledge. New methods add to this base through fine-tuning. Here I’ll touch on labelling and federated learning methods.

Learning through labels

Labeling harks back to the early days of neural nets, where you provided training pictures of dogs labelled “dog” or perhaps the breed of dog. The same intent applies here except you are training on design data examples which you want a GenAI model to recognize/classify. Since manually labeling large design datasets would not be practical, recent innovation is around semi-automated labeling assisted by LLMs.

Some large enterprises outsource this task to value-added service providers like Scale.com who deploy large teams of experts using their internal tools to develop labeling, using supervised fine-tuning (SFT) augmented by reinforcement learning with human feedback (RLHF). Something important to understand here is that labeling is GenAI-centric. You shouldn’t think of labels as tags on design data features but rather as fine-tuning additions to GenAI data (attention, etc) generated from training question/answer (Q/A) pairs expressed in natural language, where answers include supporting explanations perhaps augmented by content for RAG.

In EDA this is a very new field as far as I can tell. The topic comes up in some of the papers from the first International Conference on LLM-Aided Design (LAD) held this year at Stanford. One such paper works around the challenge of getting enough expert-generated Q/A pairs by generating synthetic pairs through LLM analysis of unlabeled but topic-appropriate documents (for example on clock domain crossings). This they augment with few-shot learning based on whatever human expert Q/A pairs they can gather.

You could imagine using similar methods for labeling around other topics in design expertise: low-power design, secure design methods, optimizing synthesis, floorplanning methods and so on. While attention in the papers I have read tends to focus on using this added training to improve RTL generation, I can see more immediate value in verification, especially in static verification and automated design reviews.

Federated Learning

Maybe beyond some threshold more training data isn’t necessarily better, but perhaps the design data that can be found in any given design enterprise doesn’t yet suffer from that problem and more data could still help, if we could figure out how to combine learning from multiple enterprises without jeopardizing the security of each proprietary dataset. This is a common need across many domains where webcrawling for training data is not permitted (medical and defense data are two obvious examples).

Instead of bringing data to the model for training, Federated Learning sends an initial model from a central site (aggregator) to individual clients and develops fine-tuning training in the conventional manner within that secure environment. When training is complete, trained parameters only are sent back to the aggregator which harmonizes inputs from all clients, then sends the refined model back to the clients. This process iterates, terminating when the central model converges.

There are some commercial platforms for Federated Learning, also open-source options from some big names: TensorFlow Federated from Google and NVIDIA FLARE are two examples. Google Cloud and IBM Cloud offer Federated Learning support, while Microsoft supports open-source Federated Learning options within Azure.

This method could be quite effective in the semiconductor space if a central AI platform or consortium could be organized to manage the process. And if a critical mass of semiconductor vendors is prepared to buy in 😀.

Perhaps the way forward for learning in industries like ours will be through a combination of these methods – federated learning as a base layer to handle undifferentiated expertise and labeled learning for continued differentiation in more challenging aspects of design expertise. Definitely an area to watch!

Also Read:

PDF Solutions Calls for a Revolution in Semiconductor Collaboration at SEMICON West

The AI PC: A New Category Poised to Reignite the PC Market

Webinar – The Path to Smaller, Denser, and Faster with CPX, Samtec’s Co-Packaged Copper and Optics

Share this post via:

Comments

There are no comments yet.

You must register or log in to view/post comments.

Quantum Computing Technologies and Challenges
There is a recent research paper that says it could be cracked with less 'cat' qubits (as in Schrodinger's cat).…

— Bernard Murphy on December 21, 2025
Quantum Computing Technologies and Challenges
You're welcome Fred! I equally enjoy your posts though I don't always understand the details :) Good question on qubit…

— Bernard Murphy on December 19, 2025
Quantum Computing Technologies and Challenges
Thanks Bernard for the shout-out and the chip details! So it looks like the qubits/mm2 is not very high, but…

— Fred Chen on December 18, 2025
The Quantum Threat: Why Industrial Control Systems Must Be Ready and How PQShield Is Leading the Defense
While you upgrade your chip to post-quantum, how about also adding security by design to protect the software (e.g. using…

— ExpensiveSand on December 15, 2025
Jensen Huang Drops Donald Trump Truth Bomb on Joe Rogan Podcast
Jensen is an international business expert who knows how to be successful with different cultures and different leadership styles. That…

— Daniel Nenni on December 8, 2025
Jensen Huang Drops Donald Trump Truth Bomb on Joe Rogan Podcast
Important for Jensen to praise Trump to keep things like this happening. Nvidia jumps amid report Commerce Dept. to open…

— KevinK on December 8, 2025
Website Developers May Have Most to Fear From AI
Thanks! I too am finding it interesting to speculate outside my normal bounds 😀

— Bernard Murphy on December 1, 2025
Website Developers May Have Most to Fear From AI
Nice read, and it's good to see more variety on semiwiki. Websites are surely going to serve a different set…

— Debamitro Chakraborti on December 1, 2025

Search Semiwiki

Recent Forum Threads

Recent Article Comments

Recent Podcast Episodes

More training or better focused training?

Learning through labels

Federated Learning

Also Read:

Comments

Recent Forum Threads

Recent Article Comments