As an example of the need for real-world reliability metrics, consider a modern automobile. We can already buy a car with parking assistance, collision avoidance, autonomous braking and adaptive cruise control features. These new features depend on video image processing that requires high-performance SoC components where multiple clock domains are certain to be required. How will the synchronizers used in the clock domain crossings (CDC) be qualified for this safety-critical service? Will it be MTBF (Mean Time Between Failures) and if so, what is the minimum MTBF required for an SoC that is installed in a million automobiles? If not MTBF, exactly what metric is appropriate to insure the safety
As we have seen in a previous blog post, data or control that passes between two clock domains having incoherent phase or frequency cannot be accomplished in a way that completely avoids synchronizer failures. Thus, it is essential to determine the probability of failure of a system to ensure it is low enough to be an acceptable risk to the user. MTBF does not explicitly take into account the number of units in the field or the length of time a unit is in service. Another common reliability metric, FIT (Failures In Time) is the inverse of MTBF (when both have consistent units). Hence, FIT – a metric that is usually measured in failures in a billion hours – is also inadequate.
A measure that determines the probability that all units in the field perform safely throughout the unit’s lifetime seems more prudent for safety-critical applications. Such a metric can be called Pr(safe). In the case of N units in the field with an average lifetime of L we can say
where the MTBF is calculated for all CDCs in the SoC. This seems simple enough and is an improvement over MTBF or FIT.
There are, however, other considerations that must be discussed for high-volume, consumer products such as automobiles. Does the variability in settling time-constants both within a chip and between chips need to be considered? What about the variation in the lifetime of an automobile? To a first order the mean settling time-constant and mean lifetime produce good estimates of Pr(safe) so long as the number of units is large. In contrast, for low-volume products, some safety margin must be included to allow for unfortunate sampling of the random distribution of transistor threshold voltages, an important determinant of synchronizer reliability. It is important to remember that CDC errors due to metastability can occur at anytime, immediately after fab or not until the end of a product’s life. Simulation is the only way to anticipate these rare occurrences.
Since it is not possible to eliminate all synchronizer failures, it is essential to mitigate the effects of such failures. There are well-known techniques to accommodate the one-clock-cycle uncertainty that accompanies synchronization in a CDC. On the other hand, the uncertainty that occurs when an invalid logic level is delivered to more than one destination is much more problematic. Such a situation can lead to invalid sequences or states with unpredictable consequences. As the fan- out of the invalid logic level grows, the complexity of anticipating all possible cases grows exponentially. Thus, it is best to reduce CDC errors by careful synchronizer design. To do so it is important to have accurate characterization of the synchronizers to understand how well, or poorly, they perform. After good synchronizer design is complete, one can then work on ways to mitigate the effects of CDC errors that are known and understood. In these cases and on those extremely rare occasions when these errors do occur, there will be no risk of bodily harm.
In verification and before sign-off the reliability metric, Pr(safe), can be revisited to estimate how the well the final design will perform in real-world production circumstances.