Equipment Troubleshooting
Your equipment can't talk back, but your people can and should.
- By Andrew Marquardt
- Feb 01, 2006
INDUSTRY has done a very good job at determining how to fix equipment to keep
it up and running. And yet, how often do we look at a failed piece of equipment
or equipment that has never worked quite right and:
- blame the manufacturer for the problems
- tell people it has always worked that way (problem? what problem?)
- assume someone else will fix it (operations or maintenance)
- ask Mr. Machinery to fix it (the person who has been around for 30
years)
As we all know, the equipment can't talk back. We may not take the time to
conduct a thorough troubleshooting and failure analysis effort to determine the
actual path to failure and identify human performance issues and root causes of
our equipment problems.
By knowing how to determine the failure mode, failure agent, and failure
classification, the professional troubleshooter can do a better job of
collecting necessary information that will help perform better root cause
failure analysis and subsequently identify the real human performance root
causes that often are the underlying reasons causing equipment problem(s). When
the real reasons for the equipment problems are known, the professional
troubleshooter can recommend and implement corrective actions that will prevent
further similar problems and allow increased mean time between failures.
Machinery Troubleshooting Objectives and Traps
Most people recognize
the primary objective of conducting machinery troubleshooting is to prevent the
equipment and machinery from repeat incidents. Although this may sound obvious,
I suggest we might not be as effective at accomplishing this as desired. One of
the reasons may be that in our troubleshooting efforts, we don't always take the
time to document our efforts. As a result, when we have a failure and we review
the machinery history for the equipment or machinery, the entry looks something
like the following:
3-25-01--Compressor failed
3-26-01--Compressor online
Unfortunately, this does not indicate what was done during the
troubleshooting process and whether actions that were taken were effective or
ineffective. A process that makes the documentation easier while conducting the
troubleshooting is a desired goal. Additionally, if a decision is made to
purchase a new piece of equipment or upgrade an existing piece, it is beneficial
to have previous problems documented so the bid specification for the new or
upgraded piece of equipment can be written better. This ensures the new or
upgraded equipment will be better suited for the desired service.
Too often, what may prevent us from fully realizing our troubleshooting
potential are some common troubleshooting traps. The first of these is the
"everyone's job and no one's job," where the Operations group decides it is too
difficult for them to fix and the Maintenance group decides it is too minor for
them to get involved. The result is often just replacing the item or addressing
the symptoms only. The second trap is the "Mr. Machinery," where a facility may
rely on a certain individual who has been at that facility for a long time. This
is not necessarily adverse, but it can be a problem if this individual has not
kept up with current technology. The third trap is "telephone troubleshooting,"
where the troubleshooter attempts to solve the problem by interviewing people
over the phone and not taking the opportunity to look at the failed equipment or
machinery in person. The fourth and final trap is "familiarity," which can arise
if a person becomes very familiar with a piece of equipment or machinery and
begins to assume the current problem is the same as the last time, whether or
not in fact it is.
Systematic Troubleshooting Process
In order to conduct
troubleshooting effectively and avoid the above-mentioned traps, you should
follow a logical process. A logical process should be able to determine "what"
happened, "why" it happened, and then develop effective fixes for the "why" it
happened.
What happened
The "what" happened can effectively be portrayed by
creating a sequence of events chart. This provides a graphical presentation of
what happened to those reviewing the incident. A common technique is to put the
incident or problem in a circle, the actions or events into boxes, and
amplifying information in ovals often called conditions.
Let's take a look at an incident that occurs with a cooling pump motor. This
pump has been in operation for over four years. Four months ago, it was taken
out of service and refurbished. Six weeks ago, it was greased as per the
manufacturer's recommendations. On the day of the incident, the pump motor began
to smoke, eventually catching fire. After disassembly of the motor, we find the
inboard bearing is burned and melted. The outboard bearing appears to be in good
condition. We can put this into a sequence of events chart and then begin to
troubleshoot using a combination of brainstorming and a cause-and-effect
technique.
Why it happened
Option 1: If we are able to gather the right people,
we should be able to develop a reasonable list of possibilities for the
burned-up bearing. Some of these may include:
- lubrication problems,
- loading problems,
- misalignment problems,
- clearance problems, and
- friction problems.
Below each of these general categories, we could postulate additional
detailed possibilities. Specifically, under the lubrication category we could
list insufficient lubrication, over lubrication, incorrect lubrication, etc. We
could continue to delve deeper under each of the general categories as the group
that we have gathered continues to brainstorm based on members' collective
knowledge and experience. This, it is hoped, would get us to at least several
different possibilities regarding the Root Cause Failure Analysis for the
burned-up failed inboard journal bearing.
Option 2: Another approach to the troubleshooting process could include
gathering the right people, but in addition, this group also could use
pre-developed and/or existing checklists that contain the most common symptoms
of bearing problems and then include the possible causes for each of the listed
symptoms. An advantage of using well-developed checklists is the tendency of
people to forget one or two symptoms and/or to forget possible causes of the
symptoms. Also, if the right people can't be gathered, you may have to rely on
people with less experience and knowledge. A checklist helps to overcome these
problems. By using checklists, you can ensure each and every troubleshooter is
relying on the best available information obtained to create the
checklists.
Path to failure
Additionally, if checklists aren't available, you can
still use a process to determine the "path to failure" to better understand the
manner and conditions that existed to create the failure or poorly performing
equipment or machinery. The first step is to determine the Failure Mode. This is
the appearance, manner, or form in which a machinery component or unit failure
manifests itself. The general categories of Failure Modes include: Deformation,
Fracture, Surface/Material Changes, and Displacement. Below each of the general
categories we can list more specific forms.
After the Failure Mode, we proceed to the Failure Agents. The Failure Agent
is the catalyst that allowed the Failure Mode to occur. The Failure Agents
consist of: Force, Reactive Environment, Time, and Temperature. One of these
will be the primary, which often creates secondary and tertiary Failure Agents
that are exhibited. It is similar to asking "why" as in Option 1 above, only now
we have a more structured and systematic process that anyone can use and
document.
After the Failure Agent, we proceed to determine the Failure Classification.
This is where we determine whether the issue is strictly an Equipment Difficulty
or whether there may be a Human Performance Difficulty associated with it. Too
often, many troubleshooters stop at this point. They are missing a valuable
opportunity to take their troubleshooting to another level. This is where we now
involve the "people talking back" part of the investigation.
Human Performance Difficulty
If you determine there was human
performance involved in the equipment issue, the troubleshooter will then need
to step out of the equipment analysis role and begin to set up interviews with
people who have interacted with the equipment or machinery in question.
To begin the process of finding the human performance root causes we can use
the Root Cause Tree, which allows us to get the human performance "why" part of
the investigation. The process starts by answering the 15 questions on the front
of the tree; this will help the troubleshooter better determine which of the
basic cause categories are applicable and which basic cause categories are not
applicable for the equipment issue being investigated. Once the 15 questions
have been answered yes or no, the troubleshooter will turn to the back of the
tree and analyze the basic cause categories identified by the 15 questions from
the front. Under each of the basic cause categories are root causes the
troubleshooter should evaluate to determine whether they apply to the equipment
issue. An example might be where a person was performing maintenance and the
procedure was not specific enough, causing the maintenance person to have to
interpret the intent of the procedure and subsequently making an incorrect
interpretation, causing the equipment to fail or not work properly.
To adequately answer the 15 questions and then identify root causes from the
Root Cause Tree, the troubleshooter needs to conduct interviews with the
appropriate people. The 15 questions and the root causes on the tree are
designed to minimize, if not eliminate, the "blame" syndrome at many companies.
This does not mean we shouldn't hold people accountable for their actions, only
that we need to look at our systems first to ensure they are providing people
with the tools they need to be successful in their jobs.
Once we are confident our systems are in order, we can then have a better
justification in applying the appropriate discipline that is warranted for the
situation.
This column appeared in the February 2006 issue of Occupational Health
& Safety.
This article originally appeared in the February 2006 issue of Occupational Health & Safety.