Whenever we push the bounds of reliability in any domain, we run into new potential sources of error. Perhaps not completely new, but rather concerns new to that domain. That’s the case for Single Event Upsets (SEUs) which are radiation-triggered bit-flips, and Single Event Transients (SETs) which are radiation-triggered pulses propagating in a circuit. These used to be important primarily for space-based electronics and devices operated close to nuclear reactors, but as circuit sizes shrink and expectations rise, they have also become a concern in safety-critical auto electronics.
SEUs and SETs can be triggered in multiple ways, through nuclear events and external electromagnetic events. Years ago, we worried about ionization caused by alpha-decay from isotopes in the lead in packaging (alpha particles have very short range so any effect has to originate very close to the die). That source seems to be less of a concern now, either thanks to isotopic refinement or use of other materials. Aside from transients induced by lightning, the rest of the problem comes from cosmic rays – there is evidence that a significant percentage of these start in, or are accelerated through supernovae though details are still in debate. The bulk of the flux incident on the earth is protons. These interact quickly in the atmosphere, in part converting to neutrons through electron capture or other mechanisms. Since neutrons have a low interaction cross section they can make it to ground-level quite easily where they can potentially disrupt electronics.
Neutrons have no charge so they can’t directly disrupt an electrical circuit, but they can collide elastically or inelastically with a nucleus. In an elastic collision, the target nucleus is knocked out of the lattice, ionizes in the disruption and can trigger an electrical event. In the inelastic case, the neutron can trigger fission which then has electrical consequences. One significant source starts with [SUP]10[/SUP]B (boron, used in doping), which with the added neutron splits into [SUP]7[/SUP]Li (lithium) and an alpha particle. And then there are multiple inelastic reactions with silicon producing ionizing secondaries:
[INDENT=2][SUP]28[/SUP]Si + n → [SUP]28[/SUP]Al + n
[SUP]28[/SUP]Si + n → [SUP]27[/SUP]Al + d
[SUP]28[/SUP]Si + n → [SUP]25[/SUP]Mg + α
[SUP]28[/SUP]Si + n → [SUP]28[/SUP]Al + p
[SUP]28[/SUP]Si + n → [SUP]27[/SUP]Al + p + n
[SUP]28[/SUP]Si + n → [SUP]24[/SUP]Mg + n + α
Since these events are triggered in or near a transistor, electromagnetic impact is all but certain. Not that this happens very often. Matter looks largely empty to neutrons so a thin die is a negligible barrier (shielding neutron flux from reactors requires thick walls of lead). But the event rate isn’t zero. Xilinx estimated mean-time between failures due to SEU for one of their large Virtex devices to be over 600 years. There is nothing here unique to FPGAs, so put 100 devices together in a car and you could have a failure about every 6 years. Put a million of those cars on the road and you have a serious problem, especially given that average car lifetime these days is around 15 years. A surprising impact for something that started towards us from millions or billions of light-years away.
How do you fix this? You could make devices bigger so that charge flux from a single event would be negligible. But that’s going in the wrong direction in modern device design. The alternative is redundancy with voting. Critical circuits have 3 versions; if one circuit is disrupted, the probability of either of the other circuits being hit at the same time is miniscule (multiply probabilities). So a vote on outputs of the three circuits should have a very high probability of being correct, even in the presences of SEUs. Don Dingee has written recently and more extensively on this topic in SemiWiki. But you can’t use redundancy everywhere – the design would be huge. So you just use redundancy in the safety-critical sections. And there’s the rub, as Shakespeare might have said if he knew more about functional safety. When you start being selective, you can make mistakes. You need a backstop to catch those mistakes.
The best way to do this, interestingly, is through fault simulation. Fault sim seemed to vanish from the design universe when DFT and ATPG took off, so it’s worth recapping what how it works. You simulate the good machine (no faults), inject a fault, simulate the bad machine and compare to find differences in behavior. Except in the SEU/SET case, we’re no longer looking for manufacturing problems. We looking for “faults” which are bit flips or propagating pulses. And we’re no longer concerned about sorting good versus bad die. We’re looking at in-field problems and want to determine if the redundancy logic adequately screens these out in safety-critical areas. This is a perfect application for fault simulation. But just like you don’t want to insert redundant logic everywhere, you don’t want to fault every single node. The safety verification team (separate from the design team) builds a “fault dictionary” listing all the nodes they want to test. Then separation of design and verification (plus ISO26262 process compliance/oversight) provides confidence that all safety-critical logic is indeed hardened to SEU/SET.
Cadence has a comprehensive functional safety simulation solution in Incisive. They work directly with all players in the value chain from auto-makers through tier-1 and on down to component manufacturers, and are actively involved in further developing the safety standards. They told me that chapter 11 of ISO-26262 is in development and is expected to quantify several requirements that today are only described qualitatively. You can learn more about the Cadence safety simulation solution HERE.
Comments
There are no comments yet.
You must register or log in to view/post comments.