Fab question: what are the symptoms that require unscheduled maintenance?

jms_embedded · Aug 22, 2023

When tools need unscheduled maintenance, what does that typically mean? Does one of the following tend to dominate?
1. some mechanical failure (CLUNK!) that breaks a component
2. tool goes offline on its own, reporting an error code, like a copier flashing "1F" to indicate something
3. process monitoring shows some process parameter out of tolerance or heading out of tolerance

Fred Chen · Aug 22, 2023

jms_embedded said:
When tools need unscheduled maintenance, what does that typically mean? Does one of the following tend to dominate?

1. some mechanical failure (CLUNK!) that breaks a component

2. tool goes offline on its own, reporting an error code, like a copier flashing "1F" to indicate something

3. process monitoring shows some process parameter out of tolerance or heading out of tolerance

The third shouldn't be an immediate cause for unscheduled maintenance, since the parameter may be affected by several modules. Usually, the most likely scenario would be sudden higher particle count or contamination in a tool.

benb · Aug 23, 2023

The science of reliability can be somewhat helpful here. Things wear out like bankrupcy--gradually then all at once. PMs (preventive maintenance) cover the gradually part. Unscheduled, longer downs, cover the "all at once".

The bathtub survival curve, actually has 3 domains: Infant mortality, where non-survival is due to manufacturing flaws; exponential wear-out, the meat of the survival curve, with a nearly flat bottom; and then end-of-life, where survival falls rapidly.

Fab tools experience many issues and require frequent intervention. Failure Analysis for common failures helps resolve when to do PMs or other interventions; basically investigating the moving part to see what makes it wear out. Each fab has to do this, to the degree they find it adds value. Some fabs do not do it. Intel does not, they spend spend spend on new parts. The fab department I'm part of does it, very successfully, and we find ways to make our parts much more reliable, and it gives us a cost advantage. We also understand the process better, since science advances when you can take something apart and put it back together, better than it was before.

jms_embedded · Aug 23, 2023

benb said:
Things wear out like bankrupcy--gradually then all at once. PMs (preventive maintenance) cover the gradually part. Unscheduled, longer downs, cover the "all at once".

Aha, that makes sense. What kinds of things wear out in that all-at-once manner?

benb · Aug 24, 2023

jms_embedded said:
Aha, that makes sense. What kinds of things wear out in that all-at-once manner?

Section 2.7 of Applied Reliability, Tobias, Trindade: Bathtube Curve for Failure Rates, the three regions are early failure period (infant mortality), intrinsic or stable failure period, which has the interesting property of "no memory" ie failures are random; and finally wearout failure period in which issues accelerate again.

Human perception of reliability is incorrect and misleading. The all at once failures are not really all at once, but are certainly an acceleration, due to wearout.

Some misperceptions about reliability that I hear: "Oh that's a dog tool from the beginning" "The older tools are the best." "That ancient tool isn't good for anything". People are anthropomorphizing the tools, analogizing them to people. Tools are better than people, in a sense, because there is no scientific reason they won't last forever, if maintained properly and parts remain available; but when people decide not to maintain a "good for nothing" or "dog" tool it's a self-fullfilling prophecy.

So to answer your question, there's a knife's edge of perception whether to maintain a tool properly, investing downtime and energy; or let it go, fight other fights; and the "all at once" failures of doomed tools is largely the result of this, mixed with some components entering wearout failure region.

nghanayem · Aug 24, 2023

jms_embedded said:
When tools need unscheduled maintenance, what does that typically mean? Does one of the following tend to dominate?

1. some mechanical failure (CLUNK!) that breaks a component

2. tool goes offline on its own, reporting an error code, like a copier flashing "1F" to indicate something

3. process monitoring shows some process parameter out of tolerance or heading out of tolerance

For 2 to happen 1/3 would have to happen first. I don't want to get into specifics on all the sorts of causes for tools going down for a litany of reasons, but the reason most pertinent to you would be that every tool (both in type and exact specimen is different).

Using a few generic chemical engineering examples:
-A catastrophic failure like reactor temp rising above safe levels and hopefully triggering pressure relief valves or a pipe cracking and releasing toxicants.
-Rheology or chemical composition of the products of some unit operation (or worse yet finding issues with the final product at end of line). In the case of finding issues either in-line or end-of-line (EOL) can often be traced back to where that issue could have happened. As a super basic example if you are getting more plastic precursor inside of your gasoline mixture then you should, it is pretty easy to trace that back to an issue with your distillation column and investigate what is causing more of the plastic precursor to stay lower in the column than is normal. As for how those wizards in defect metrology figure anything out from EOL data; your guess is as good as mine.
-As you stated in 3 if you have some process or tool parameter (for example yield, reaction rate, or selectivity of a chemical reactor) either above/below acceptable limits can cause unsch down time. As you also stated if the trend is bad it is often best to bite the bullet and fix the issue before it becomes a problem that gives you unsellable products. Finally if one specific piece of equipment is running far from where your other tools are (for example a fermenter that is running within control but with a much higher reaction rate than your other units) would warrant being taken offline to troubleshoot the issue and make sure you have a uniform product (after all who wants 1/10 beers to be consistently different from the rest).

Hope this is helpful!

benb said:
Some misperceptions about reliability that I hear: "Oh that's a dog tool from the beginning" "The older tools are the best." "That ancient tool isn't good for anything". People are anthropomorphizing the tools, analogizing them to people. Tools are better than people, in a sense, because there is no scientific reason they won't last forever, if maintained properly and parts remain available; but when people decide not to maintain a "good for nothing" or "dog" tool it's a self-fullfilling prophecy.

Is this the part where we debate about the nature of the dog tool of Theseus?

milesgehm · Aug 28, 2023

Some times a PM (preventive maintenance) can cause an unscheduled down. Workmanship is really important. You can have the checklists and monitors in the world, but unfortunately Murphy's Law is in effect in fab.

Fred Chen · Aug 29, 2023

Fred Chen said:
The third shouldn't be an immediate cause for unscheduled maintenance, since the parameter may be affected by several modules. Usually, the most likely scenario would be sudden higher particle count or contamination in a tool.

Component wear-out EOL is probably the next most likely.

mozartct · Aug 29, 2023

Many good comments here. Fabs exist in a tension between wafer outs (the manufacturing side) and tool uptime (engineering or maintenance), with process and yield acting as kind of referees. Some unscheduled downs are accepted since they can lead to profitable outcomes (more wafers produced), keeping in mind that revenue per wafer is itself a variable. To use an analogy, I can be proactive and take a tool down today, costing me 100 wafers or i can take a chance (i.e. run to fail) and deal with the unscheduled event at some other time in the future when my 100 wafers are worth less, sometimes nearly nothing if the fab is under loaded. Not naming names but some factories in Taiwan are running at 50% so they most certainly put no value on uptime. Run to fail it is in that case.

Good companies will investigate failure and use 2nd sourcing programs and R/D to avoid such issues in the future. Others do not (aka intel). Availability of qual time is a MAJOR obstacle to improving tool uptime via this path, at least in the USA.

It's probably easier to define what does not fail than what does fail. Static parts that see no temperature or flow (chemical, gaseous, electrical) never fail. Anything else can and will fail. Chips, shower heads, gas lines, filters, fittings, PCBs, robots, bearings, grease within bearings etc.

As mentioned by @milesgehm, preventive maintenance can cause major problems. It's slowly/rapidly being phased out and replaced by advanced analytics. The airline business does the same. Monitor, analyze but do not take down what works because "we always do it that way".

Search

Fab question: what are the symptoms that require unscheduled maintenance?

jms_embedded

Active member

Fred Chen

Moderator

benb

Well-known member

jms_embedded

Active member

benb

Well-known member

nghanayem

Well-known member

milesgehm

Active member

Fred Chen

Moderator

mozartct

Active member