We’ve introduced the concepts behind triple modular redundancy (TMR) before, using built-in capability in Synopsys Synplify Premier to synthesize TMR circuitry into FPGAs automatically. A recent white paper authored by Angela Sutton revisits the subject with some important new insights.
As more systems become -critical, high reliability is an increasing concern for FPGA designers. Radiation tolerance was once territory only for teams working on applications in space, defense, nuclear medicine, and nuclear power plant environments. FPGAs have changed drastically over the last several years, and are now finding applications in automotive, media broadcast centers, and more medical and industrial applications. The result?
With today’s process geometries shrinking and switching speeds increasing, the critical charge required to create a significant glitch decreases and the probability of that glitch being retained increases.
Soft errors in FPGAs due to radiation are also increasing the stakes because they are now often found handling volumes of big data. FPGAs are popular vehicles for algorithm acceleration in deep learning, fintech, IoT, and other analytics uses. A single upset can corrupt an entire data stream or analytics result set, and the effects can cascade across a network quickly – a very hard scenario to detect and correct downstream.
A time-proven way to defeat single-bit errors is the use of TMR. The odds of radiation striking a single circuit are pretty good, but the odds of radiation simultaneously striking three circuits (perhaps distributed physically across the FPGA) are much lower. TMR essentially implements a voter, executing the logic three times in parallel and selecting either the unanimous result or a majority result if one of the three paths has encountered a glitch.
Sutton goes through the basics of the three flavors of TMR: local, distributed, and block. We’ve seen the discussion on which approach to use when based on what is being protected before. Fortunately, once design decisions are made, Synplify Premier handles the heavy lifting of duplication and additional circuitry automatically.
Prior discussion on TMR implementations usually ends about there; the proof is left to the reader. However, this paper contains a new section titled:
Susceptibility of FPGA building blocks to SEUs depends on the type of FPGA device fabric
Yes, there is a major difference in how antifuse (Microsemi), flash (Altera), and SRAM-based (Xilinx and newer Altera parts) FPGAs behave when hit with radiation – again, this is a soft error discussion; hard radiation failures are another subject entirely. The differences in soft error manifestation arise from how registers, memories, configuration bits, interconnect routing, and other details are handled, and how they are susceptible to upset.
One big observation here is single-bit errors can be transparently corrected, but often just as important is the flagging of any error condition. Additional mitigation in hardware or software can be kicked off when error flags appear, preventing the error from propagating. There is also discussion of synchronous feedback and flushing, another challenge where errors creep through.
To register for downloading the Synopsys whitepaper:
Synopsys has spent a lot of R&D effort in Synplify Premier understanding TMR mitigation and the differences in approaches between FPGA families. While designers could certainly invest effort in learning and mastering those nuances, there is a tremendous amount of value in using a tool that accounts for many radiation protection scenarios and handles FPGA synthesis accordingly without extensive manual labor.