SiC Forum2025 8 Static v3
WP_Term Object
(
    [term_id] => 6435
    [name] => AI
    [slug] => artificial-intelligence
    [term_group] => 0
    [term_taxonomy_id] => 6435
    [taxonomy] => category
    [description] => Artificial Intelligence
    [parent] => 0
    [count] => 719
    [filter] => raw
    [cat_ID] => 6435
    [category_count] => 719
    [category_description] => Artificial Intelligence
    [cat_name] => AI
    [category_nicename] => artificial-intelligence
    [category_parent] => 0
)

Learning from In-House Datasets

Learning from In-House Datasets
by Bernard Murphy on 10-22-2025 at 6:00 am

At a DAC Accellera panel this year there was some discussion on cross-company collaboration in training. The theory is that more collaboration would mean a larger training set and therefore higher accuracy in GenAI (for example in RTL generation). But semiconductor companies are very protective of their data and reports of copyrighted text being hacked out of chatbots do nothing to allay their concerns. Also, does evidence support that more mass training leads to more effective GenAI? GPT5 is estimated to have been trained on 70 trillion tokens versus GPT4 at 13 trillion tokens, yet GPT5 is generally viewed as unimpressive, certainly not a major advance on the previous generation. Maybe we need a different approach.

Learning from In-House Datasets

More training or better focused training?

A view gathering considerable momentum is that while LLMs do an excellent job in understanding natural language, domain-specific expertise is better learned from in-house data. While this data is obviously relevant, clearly there’s a lot less of it than in the datasets used to train big GenAI models. A more thoughtful approach is necessary to learn effectively from this constrained dataset.

Most/all approaches start with a pre-trained model (the “P” in GPT) since that already provides natural language understanding and a base of general knowledge. New methods add to this base through fine-tuning. Here I’ll touch on labelling and federated learning methods.

Learning through labels

Labeling harks back to the early days of neural nets, where you provided training pictures of dogs labelled “dog” or perhaps the breed of dog. The same intent applies here except you are training on design data examples which you want a GenAI model to recognize/classify. Since manually labeling large design datasets would not be practical, recent innovation is around semi-automated labeling assisted by LLMs.

Some large enterprises outsource this task to value-added service providers like Scale.com who deploy large teams of experts using their internal tools to develop labeling, using supervised fine-tuning (SFT) augmented by reinforcement learning with human feedback (RLHF). Something important to understand here is that labeling is GenAI-centric. You shouldn’t think of labels as tags on design data features but rather as fine-tuning additions to GenAI data (attention, etc) generated from training question/answer (Q/A) pairs expressed in natural language, where answers include supporting explanations perhaps augmented by content for RAG.

In EDA this is a very new field as far as I can tell. The topic comes up in some of the papers from the first International Conference on LLM-Aided Design (LAD) held this year at Stanford. One such paper works around the challenge of getting enough expert-generated Q/A pairs by generating synthetic pairs through LLM analysis of unlabeled but topic-appropriate documents (for example on clock domain crossings). This they augment with few-shot learning based on whatever human expert Q/A pairs they can gather.

You could imagine using similar methods for labeling around other topics in design expertise: low-power design, secure design methods, optimizing synthesis, floorplanning methods and so on. While attention in the papers I have read tends to focus on using this added training to improve RTL generation, I can see more immediate value in verification, especially in static verification and automated design reviews.

Federated Learning

Maybe beyond some threshold more training data isn’t necessarily better, but perhaps the design data that can be found in any given design enterprise doesn’t yet suffer from that problem and more data could still help, if we could figure out how to combine learning from multiple enterprises without jeopardizing the security of each proprietary dataset.  This is a common need across many domains where webcrawling for training data is not permitted (medical and defense data are two obvious examples).

Instead of bringing data to the model for training, Federated Learning sends an initial model from a central site (aggregator) to individual clients and develops fine-tuning training in the conventional manner within that secure environment. When training is complete, trained parameters only are sent back to the aggregator which harmonizes inputs from all clients, then sends the refined model back to the clients. This process iterates, terminating when the central model converges.

There are some commercial platforms for Federated Learning, also open-source options from some big names: TensorFlow Federated from Google and NVIDIA FLARE are two examples. Google Cloud and IBM Cloud offer Federated Learning support, while Microsoft supports open-source Federated Learning options within Azure.

This method could be quite effective in the semiconductor space if a central AI platform or consortium could be organized to manage the process. And if a critical mass of semiconductor vendors is prepared to buy in 😀.

Perhaps the way forward for learning in industries like ours will be through a combination of these methods – federated learning as a base layer to handle undifferentiated expertise and labeled learning for continued differentiation in more challenging aspects of design expertise. Definitely an area to watch!

Also Read:

PDF Solutions Calls for a Revolution in Semiconductor Collaboration at SEMICON West

The AI PC: A New Category Poised to Reignite the PC Market

Webinar – The Path to Smaller, Denser, and Faster with CPX, Samtec’s Co-Packaged Copper and Optics

Share this post via:

Comments

There are no comments yet.

You must register or log in to view/post comments.