In verification there is an ever-popular question, “When can we stop verifying?” The intent behind the question is “when will we have found all the important bugs?” but the reality is that you stop verifying when you run out of time. Any unresolved bugs appear in errata lists delivered with the product (some running to 100 or more pages). Each bug accompanied by a suggestion for an application software workaround and commonly a note that there is no plan for a fix. Workarounds may be OK for some applications but crippling in others, narrowing market appeal and ROI for the product.
A better question then is how to catch more bug escapes (those bugs that make it silicon and are documented as errata) in the time allocated for verification.
I’m not going to attempt an exhaustive list but among deterministic digital root-causes three stand out. Bugs that are almost but not quite unreachable, bugs that are simply missed, and bugs resulting from incompleteness in the spec. The best defense against the first class is formal verification. Ross Dickson and Lance Tamura (both Product Management Directors at Cadence) suggest that defenses against the second and third classes are greatly strengthened by running more software testing against the pre-silicon design model.
Industry surveys indicate that 3 out of 4 designs today require at least one silicon respin. Running software on real silicon (even if not good silicon) seems like an easy way to run extensive software suites to catch most errata, right? That’s a slippery slope; respins are very expensive, reserved for what absolutely cannot be caught pre-silicon. And early silicon is not guaranteed be functional enough to run all the software required to expose errata bugs. It is more practical and cost-effective to trap potential errata pre-silicon.
You might think that errata would commonly result from complex sequence problems. In fact many are surprisingly mundane, arising from simple two factor or exception conditions. Here are a few cases I picked at random from openly available errata:
- When a receive FIFO overruns it becomes unrecoverable
- A bus interface hangs after master reset
- After disabling an FPU another related function remains in high power mode
I’m sure the verification teams for these products were diligent and tested everything they could imagine in the time they had available, but still these bugs escaped.
The spec class of issues is especially problematic. Specs are written in natural language, augmented by software test cases and more detailed requirements. All of which attempt to be as exact as possible but still have holes and ambiguities.
Software and hardware development teams work in parallel from the same specs but in different domains. The software team uses an ideal virtual model of the hardware to develop software which will eventually run on the target hardware. At the same time, the hardware team builds a hardware model to implement the spec. Both faithfully design and test against their reference. When they hit a place where the spec is incomplete, they must make a choice. Assume A or assume B, or both or neither, or maybe it doesn’t matter?
Unsurprisingly this doesn’t always work out well. Best case, the hardware and software teams have a meeting, ideally with architects, to make a joint decision. Sometimes, when the choice seems inconsequential and schedule pressure is high, a choice is made locally without wider consultation. A good example here is bit settings in configuration registers. Maybe a couple of these are presumed to be mutually exclusive, but the spec doesn’t define what should happen if they are both set or both cleared. The hardware team chooses a behavior which doesn’t match the software team’s expectation. Neither team catches the inconsistency when running against their own reference models. A problem emerges only when the real software runs on the real hardware.
For ChatGPT fans, ChatGPT and equivalents are not a solution. Natural language in inherently ambiguous and incomplete. To the extent that ChatGPT will identify a hole, it must still make a choice to plug that hole. Absent documented evidence, that choice will be random (best-case) and may be inconsistent with customer requirements and use-cases.
Granting that no solution can catch all bug-escapes, software-based testing is a pretty good augment to hardware-based testing. As an example, driver and firmware teams develop and debug their code against the virtual model to the point they can find no errors. When a hardware prototype becomes available, they run against that model to confirm that a task can be completed in X cycles, as expected. Routinely, they also catch bugs that escaped hardware verification.
OK, so the verification team adds more tests to catch those cases but there are two larger lessons here. Software testing is inherently complementary to hardware testing. The software team has no interest in debugging the hardware; they just want to make sure their own code is working. They test with a different mindset. This style of testing can and does often expose spec inconsistencies.
Escapes due to unexpected two or more factor conditions or exceptions are more likely to be caught through extensive stress testing. Software-based testing is top-down, the easiest place to experiment with stress tweaks (such as throwing in a reset or overloading a buffer). This might require some hardware/software team collaboration but nothing onerous.
More software-based testing needs hardware assist
You’re not going to run software-based testing on a hardware model without acceleration, the more of it the better if you really want to minimize errata. Prototyping to run as fast as possible, closely coupled with emulation to debug problems, both extended with virtual models for context. You should check out Cadence Protium, Palladium, and Helium. Cadence also offers demand-based cloud access to Palladium.