WP_Term Object
(
    [term_id] => 121
    [name] => IROC Technologies
    [slug] => iroc-technologies
    [term_group] => 0
    [term_taxonomy_id] => 121
    [taxonomy] => category
    [description] => 
    [parent] => 14433
    [count] => 13
    [filter] => raw
    [cat_ID] => 121
    [category_count] => 13
    [category_description] => 
    [cat_name] => IROC Technologies
    [category_nicename] => iroc-technologies
    [category_parent] => 14433
)

System Reliability Audits

System Reliability Audits
by Paul McLellan on 07-25-2013 at 12:09 pm

How reliable is your cell-phone? Actually, you don’t really care. It will crash from time to time due to software bugs and you’ll throw it away after two or three years. If a few phones also crash due to stray neutrons from outer space or stray alpha particles from the solder balls used in the flip-chip bonding then nobody cares.

How about your heart pacemaker? Or the braking system in your car? Or the router at the head of a transpacfic fiber-optic cable? OK, now you start to care.


iRocTech provides audit services at the system level for these sort of situations. However, at the system level, the overall reliability depends, obviously, on the reliability of the various components. One big problem is that the component suppliers are not always co-operative. In some cases they simply don’t know the reliability of their components. But also they tend to want to provide the best possible data so that it cannot be used against them. It is as if we went to TSMC and asked about cell-timing and got given the typical corner and then were told that they hadn’t a clue when we asked about a worst case corner because they didn’t want anyone to know just how slow the process might get.

The problem is actually getting worse. For all the same reasons that we want to put 28nm and 20nm silicon into cell-phones (especially low dynamic and low leakage power, lots of gates, performance), engineers designing implantable medical electronics and aviation electronics want to do so to. But the leading edge processes and foundries are driven by the mobile industry which is probably the industry the least concerned with reliability of all semiconductor end-markets (well, OK, birthday cards that play a tune when you open them, $5 calculators, but these are not really markets). This means that there is not as much focus on reliability and measuring it as the markets outside of mobile require.

 The big markets that iRoC works on for system reliability are:

  • networking: not your living room wireless router but the big ones that form internet and corporate backbones. they need an accurate MTBF number
  • automotive: an especially extreme temperature environment (it gets hot under the hood in the desert) and very long lifetime (cars need to work for 15-20 years)
  • avionics: at high altitude (never mind in space) there is 3-400 times the neutron flux that there is at sea level
  • medical: in particular implantable medical. these are very low voltage since you may have to open up someones chest when the battery runs out. and they sometimes get in hostile environments too when you go for an MRI or a CAT scan or get in a plane
  • nuclear plants: historically these have been build with mostly electo-mechanical technology due to neutrons and gamma rays that may be released in an emergency, but they are now retrofitting and need to be able to use electronics
  • military and space: there really aren’t any rad-hard foundries left so commercial components are used more and more, but reliability has to be high in an aggressive environment

What these industries would like to do is to push down their system reliability requirements to the component vendors, but compared to mobile they don’t have enough influence, at least in the short term. A second best solution is to find out the reliability of the components and back it up to a system reliability number.

One end-market that is not on the list is cloud computing. At the level of big data centers, events that we consider rare on our own computer (a disk drive fails, the processor melts, the power-supply blows up) are everyday occurrences and so the infrastructure has to be built to accommodate this. For example, GFS (Google File System) never stores any file on less than three separate disks in different geographical locations (Google is actually prepared for a meteor hit on a datacenter that permanently destroys it without impacting service). I don’t want to imply Google is special, I’m sure Facebook and Amazon and Apple are all the same, just that I know a little more about Google since they have published more in the open literature (and I have done some consulting for them).

Since some measurable problems especially latchup and single event failure interrupt (SEFI) are actually very rare, they are hard to measure. If only a short period of measurement is done then the numbers may look deceptively good. However, the reality is that the mean might be good but the standard deviation is enormous. A better reliability measure than the mean alone is the mean plus one standard deviation. To get that measure to look good, extensive measurement is required to get the standard deviation down to something manageable along with a better estimate of the mean. Single event upsets (SEE) which can be accelerated with a neutron beam (as I wrote about here) are much more common and so the standard deviation is much narrower.


Of course, once there is a measure, the question is what to do about it. It is a well-known proverb that a chain is only as strong as the weakest link. But a corollary is that there is no point in having especially strong links, in particular there is no point in strengthening links other than the weakest. Identifying the lowest reliability component and improving it is how overall system reliability can be improved.

iRoc Technologies website is here.


Comments

0 Replies to “System Reliability Audits”

You must register or log in to view/post comments.