Harry Foster waxes philosophical in a recent white paper from Siemens EDA, in this case on the origins of bugs and the best way to avoid them. Spoiler alert, the answer is not to make them in the first place or at least to flush them out very quickly. I’m not being cynical – that really is the answer though practice often falls short of ideal. Harry suggests we need to get back to basics in RTL design quality, and what better place to start than W. Edwards Deming, a founding father of Total Quality Management.
Quality must be designed in
This seems trite but it’s often the simple mistakes that bite us, like an out-of-range indexing error. Best case they slow down system level testing, worst case they make it through to silicon. It’s easy for us to believe that we are mostly infallible and what few mistakes we make will be caught in verification. But survey after survey proves that trivial mistakes still slip through, because we should know we left the mirage of exhaustive testing behind a long time ago.
Proving that intent is met
Since the method connects design and intent, the second step aims to prove in design that the intent is met. Harry’s suggestion here is particularly to leverage static and formal verification tools. We are designing quality in, so this is a task for RTL designers. Who already have access to a wide range of apps to simplify this analysis. They can find FSM deadlocks, arithmetic overflow possibilities and potential indexing errors. For possible domain crossing bugs, they can find metastability potential and other domain crossing errors which in many cases cannot be detected at all in simulation. Another possible source of errors is in X optimism and pessimism. The former may at least waste valuable time in system-level verification and the latter can create mismatches between RTL and gate-level sims which even equivalence checking may not find.
Your system verification team will thank you. Or they may curse you if they find problems you could have fixed before you checked in your code.
The third pillar requires that intent should be protected through the rest of the design lifecycle by continued testing. Harry’s suggestion is to adopt a continuous integration (CI) flow here. We simply reuse the static and formal tests we developed and proved in design. These are largely hands-free and fast tests which should quickly flag checkin mistakes (we all make them).
A final (blogger) thought
This is a worth addition to the canon. We all nod wisely but we still trip up sometimes. With tools like CI we should be able to flush out more of these problems early on.
That said, there are some system-level problems which remain challenging, and which can’t be fixed (I think) at the unit-level. Cache coherence problems, emerging only after billions of cycles are one good example. Power bugs are difficult to cover fully in designs with very complex power and voltage switching. Security problems around speculative execution are another example. It would be great to find some kind of “unit test” methodologies around these system-level “IP”.
You can access the white paper HERE.
Also ReadShare this post via: