I wrote earlier about how deep expertise, say for high-quality RTL design or verification, must be extracted from in-house know-how and datasets. In general, such methods start with one of many possible pre-trained models (GPT, Llama, Gemini, etc.). To this consultants or in-house teams add fine-tuning training, initially through supervised fine-tuning (SFT), refined through reinforcement learning with human feedback (RLHF) and subsequently enhanced/maintained through iterative refinement. ChatGPT claims this is the dominant flow (I incline to thinking them pretty accurate in their own domain). Supervision is through labeling (question/answer pairs). In most cases relying on human labeling alone is too expensive, so we must learn how to automate this step.

A nice example of SFT from Microsoft
This Microsoft paper studies two different methods to fine-tune a pre-trained model (GPT-4), adding expertise on recent sporting events. The emphasis in this paper is on the SFT step rather than following steps. Before you stop reading because this isn’t directly relevant to your interests, I can find no industry-authored papers on fine-tuning for EDA. I know from a comment at a recent conference that Microsoft hardware groups are labeling design data, so I suspect topics like this may be a safe proxy for publishing research in areas relevant to internal proprietary work.
Given the topic tested in the study, the authors chose to fine-tune with data sources (here Wikipedia articles) added after the training cutoff for the pre-trained model, in this case September 2021. They looked at two approaches to fine-tuning on this corpus, one token-based and one fact-based.
The token-based method for label generation is very simple and mirrors the standard practice for generation per the paper. Here they seed with a manually generated label per the article overview section and prompt to generate a bounded set of labels from the article. The second method (which they call fact-based) is similar except that it prompts the model to break down complex sentences if needed into multiple atomic labels. The authors also allowed for some filtering in this case to remove facts irrelevant to the purpose of the study. Here also the model was asked to generate multiple unique labels.
The paper describes training trials, run in each case on the full set of generated labels, also subsets to gauge sensitivity to training sample size. Answers are validated using the same model running a test prompt (like a test for a student) allowing only pass/fail responses.
The authors compare accuracy of results across a variety of categories against results from the untuned pre-trained model, their range of scaled fine-tuned options, and against RAG over the same sections used in fine-tuning but based on Azure OpenAI hybrid search. They conclude that while token-based training does increase accuracy over the untrained model, it is not as uniform in coverage as fact-based training.
Overall they find that SFT significantly improves performance over the base pre-trained model within the domain of the added training. In this study RAG outperforms both methods but they get close to RAG performance with SFT.
I don’t find these conclusions entirely surprising. Breaking down complex sentences into individual labels feels like it should increase coverage versus learning from more complex sentences. And neither method should be quite as good as vector-based search (more global similarity measures) which could catch inferences that might span multiple statements.
Caveats and takeaway
Fine-tuning is clearly still a very dynamic field, judging by recommended papers from Deep Research in Gemini and ChatGPT, complemented by my own traditional research (Google Scholar for example, where I found this paper). There is discussion of synthetic labeling, though concerns that this method can lead to significant errors without detailed human review.
One paper discusses how adding a relatively small set (1000) of carefully considered human-generated labels can be much more effective for performance than large quantities of unlabeled or perhaps synthetically labeled training data.
There is also concern that under some circumstances fine-tuning could break capabilities in the pre-trained model (this is known as catastrophic forgetting).
My takeaway is that it is possible to enhance a pre-trained model against training data and modest training prompts and get significantly better response accuracy than the pre-trained model alone could provide. However expert review is important to build confidence in the enhanced model and it is clear that 100% model accuracy is still an aspirational goal.
Also Read:
Lessons from the DeepChip Wars: What a Decade-old Debate Teaches Us About Tech Evolution
TSMC Kumamoto: Pioneering Japan’s Semiconductor Revival
AI RTL Generation versus AI RTL Verification
Share this post via:


PDF Solutions Charts a Course for the Future at Its User Conference and Analyst Day