I wrote recently on ANSYS and TSMC’s joint work on thermal reliability workflows, as these become much more important in advanced processes and packaging. Xilinx provided their own perspective on thermal reliability analysis for their unquestionably large systems – SoC, memory, SERDES and high-speed I/O – stacked within a package. This was presented at the ANSYS Innovation Conference in Santa Clara recently. They put special emphasis on applications on datacenters and automotive, two areas where FPGAs are playing important roles for their ability field-upgradable to meet new demands.
I’ve talked before about AI functions in the datacenter. Die sizes of leading-edge 7nm AI chips are already reaching reticle limits and consume hundreds of watts of power. Automotive electronics, on the other hand, operate in very harsh environments for extended periods of time. They must be highly reliable, safe and have a zero-field failure rate over a life span of 10 to 15 years. The smallest failure in a safety critical system could potentially cause a fatality, which is unacceptable. Both create new demands to ensure thermal reliability.
Thermal reliability is a big deal in these advanced designs for multiple reasons. First FinFET transistors are prone to something called self-heating. They heat up more quickly than traditional planar transistors. Second, interconnects have some resistance and generate heat when current flows (Joule heating). Third, heat dissipates very slowly on electronic switching time scales. And fourth, all this heating is compounded when you stack chips on top of each other. That’s a problem for reliability because increased temperatures affect (among other things) increased electromigration (EM), thermally induced mechanical stress and solder joint fatigue, leading ultimately to functional failures.
What I found interesting about the Xilinx story, because I’m a math and physics nerd, is that Xilinx and ANSYS shared a bit more on how this flow handles modeling for the heat diffusion problem in chip/package structures.
ANSYS RedHawk (or ANSYS Totem for analog blocks) computes, based on detailed knowledge of layout and structures together with simulation, a T (temperature above nominal) for each wire. This comes from self-heating and Joule-heating. Do this for all wires. Then, per wire, look at the impact of heating in neighboring wires. The closer a neighbor is to the wire of interest, the higher the impact it will have. These coupling contributions are calibrated to the process in the tool. Add together all meaningful contributions from neighboring wires (superposition) and you get the total heating in the current wire.
Turns out this can overestimate heating in some cases. For example, foundry estimates might show no more than a 5o T in areas of dense heating, where a superposition calculation can exceed that limit. Xilinx and ANSYS figured out a way to compensate for this effect by applying a T clamping approach which bounds this over-estimate. It also estimates heating for current flow isolated to a single wire to more like 1.25o, well below the nominal 5o T, correlating well with foundry estimates. Based on these calculations, local EM failure rates can be calculated quite accurately and can show, especially in those isolated wire heating instances, less pessimistic estimates than global approaches.
Xilinx next talked about temperature gradients across the chip. Traditionally you require that worst-case transistor junction temperatures be held below some maximum allowable level across the design. Heating from any of the above sources adds to this problem, leaving you with few options – spread out any places that get hot, wasting area or go for a bigger device, or run the clock slower until the chip cools down. But a more granular approach may show that trying to design your way out of the problem is over-compensating.
Here the Xilinx approach gets interesting. They calculate a cumulative failure rate (CFR) for the chip in a composition fashion, where each block has its own CFR budget. For blocks where T is low in this temperature distribution, there is no concern. For a block where T is high, they re-examine the CFR budget for that block to determine if it can be adjusted to still ensure an acceptable lifetime for the whole device. They don’t explain how they do this, but they do provide a couple of references that the more determined among you may find relevant.
Interesting study, you can learn more by registering to watch the webinar.