How should we assess the risk of harmful metastability in a clock domain crossing (CDC) when the semiconductor process has significant parameter variability? One possibility is to determine the MTBF of a synchronizer at the worst-case corner of the CDC. But that approach has some conflicting complications:
- Synchronizer failures can occur at any time before or after the MTBF.
- Most chips in a wafer perform better than they do at the worst-case corner.
- High-volume, safety-critical products should be held to a high standard.
- The worst-case environment for a CDC may be rare in actual use.
An alternative that has received some recent attention is a Monte Carlo simulation, one that randomly varies PVT conditions over the range expected for the product. Instead of an estimate of MTBF, this approach leads to an estimate of the probability of a metastability induced synchronizer failure, assessed over the expected distribution of parameters and conditions. However, in the early stages of design, an impractical level of effort is required to investigate carefully even a few alternatives.
These thoughts led me to investigate the effects on the metastability settling time-constant that result from variability in the transistor threshold voltage. This approach bypasses the extensive burden of Monte Carlo simulation, but still provided an increase in understanding of the effects of parameter variability.
I choose to study the variation in threshold voltage because it has a major effect on the settling time-constant and is a classic example of a Gaussian distribution. This investigation led me to wonder what happens when the cross-tied transistors’ threshold voltages are at an extreme value of that distribution and in the vicinity of the metastability voltage V[SUB]m[/SUB]. If the simple theoretical model of a strongly inverted transistor holds, the settling time-constant would be infinite; not a good thing for a synchronizer’s MTBF. But that’s theory. Can reality be different?
To anticipate reality, we used a benchmark circuit, PublicSync,to simulate synchronizer behavior. As you can see in the above figure, the settling time-constant grows significantly as the transistor threshold moves toward and above the metastability voltage V[SUB]m[/SUB]and it does so smoothly without a singularity at V[SUB]m[/SUB]. These results were obtained by the analysis tool, MetaACE,using an automated scan of V[SUB]t[/SUB]. By fitting a curve to the data points it was possible to calculate the probability of a synchronizer failure, Pr(fail) given the mean and standard deviation of V[SUB]t[/SUB]. The details of this calculation can be found here.
The table shows the ratio of two normalized calculations of the probability of failure:
- p[SUB]wc[/SUB](fail, t[SUB]s[/SUB]): calculated assuming worst case (wc) conditions for V[SUB]t[/SUB]and with an allowed settling time t[SUB]s[/SUB]
- p[SUB]vt[/SUB](fail, t[SUB]s[/SUB]): calculated assuming a distribution of V[SUB]t[/SUB]and an allowed settling time t[SUB]s[/SUB]
This ratio of probabilities shows how failures are overestimated by the worst-case (wc) measure as compared with the varying threshold (vt) measure. The wider the distribution of threshold voltage and the longer the allowed settling time (t[SUB]s[/SUB]), the more this discrepancy grows. For a 500 MHz clock the right-hand column would correspond to two-stage synchronizer and a latency of 4 ns. For example, for some unsurprising safety-critical product conditions and a standard deviation of 20 mv, the wc measure suggests an extra, but unnecessary synchronizer stage with its accompanying added latency.
So the take-way message for me is: calculate the probability of failure Pr(fail) and not worst-case MTBF. Such a probability-based measure of risk should include the number of units in use, the unit lifetime and the distribution of semiconductor parameters such as transistor threshold voltage. Pr(fail) also avoids the misleading tendency to associate the MTBF with a failure-free period. A mistake many make, but one that masks the real possibility of failures that occur at anytime during a product’s lifetime.
Share this post via:
The Intel Common Platform Foundry Alliance