Yesterday I talked to Shaker Sarwary, the senior product director for Atrenta’s clock-domain crossing (CDC) product SpyGlass-CDC. I asked him how it came about. The product was originally started nearly 8 years ago, around the time Atrenta itself got going. Shaker got involved about 5 years ago.
Originally this was a small insignificant area of timing analysis. Back then there were few chips failing from CDC problems for two reasons. First, chips had few clock domains (many chips had only one so CDC problems were impossible) and second the chips were not that large. So CDC analysis was done by running static timing (typically PrimeTime of course) which would throw up the CDC paths as areas which are ignored by timing analysis. They could then be checked manually to make sure that they were correctly synchronized.
But like so many areas of EDA, a few process generations later the numbers all moved. The number of clocks soared, the size of chips soared and more and more chips were failing due to CDC problems. To make things worse, CDC failures are typically intermittent, a glitch that gets through a synchronizer occasionally for example. But there were no tools to deal with this issue in any sort of automated way.
Atrenta started by creating a tool that could extract the CDC paths and look for rudimentary synchronizers such as double flops (to guard against metastability). This structural analysis became more and more sophisticated, looking for FIFOs, handshakes and other approaches to synchronizing across clock domain boundaries.
Eventually this purely structural approach alone was not enough and a functional approach needed to be added. This uses Atrenta’s static formal verification engine to check that various properties remain true under all circumstances. For example,consider the simple case of a data-bus crossing a clock domain along with a control signal; for this to be safe the data-bus signals must be stable when the control signal indicates the data is ready. Or, in the case of using a FIFO to create some slack between the two domains (so data can be repeatedly generated by the transmitter domain and stored until the receiver domain can accept it. The FIFO pointers need to be Gray coded (so that only one signal changes when the pointer is incremented or decremented, a normal counter generates all sorts of intermediate values when carries propagate) and again, proving this cannot be done simply by static analysis.
When CDC errors escape into the wild it can be very hard to determine what is going on. One company in the US, for example, had a multki-million gate chip connecting via USB ports. It would work most of the time but freeze every couple of hours. It took 3 months of work to narrow it down to the serial interfaces. After a further long investigation it turned out that it was a CDC problem generating intermittent glitches. There were synchronizers but glitches must either be gated off (for data) or not generated (for control signals).
Another even more subtle problem was a company in Europe that had an intermittent problem that, again, took months to analyze. It turned out that the RTL was safe (properly synchronized) but the synthesis tool modified the synchronizer replacing a mux (glitch-free) with a complex AOI gate (not glitch free).
In a big chip, which may have millions of of CDC paths, the only approach is to automate the guarantees of correctness. If not you are doomed to spend months working out what is wrong instead of ramping your design to volume. More information on SpyGlass-CDC here.