Equipment Troubleshooting -- Occupational Health & Safety

Equipment Troubleshooting

Your equipment can't talk back, but your people can and should.

By Andrew Marquardt
Feb 01, 2006

INDUSTRY has done a very good job at determining how to fix equipment to keep it up and running. And yet, how often do we look at a failed piece of equipment or equipment that has never worked quite right and:

blame the manufacturer for the problems
tell people it has always worked that way (problem? what problem?)
assume someone else will fix it (operations or maintenance)
ask Mr. Machinery to fix it (the person who has been around for 30 years)

As we all know, the equipment can't talk back. We may not take the time to conduct a thorough troubleshooting and failure analysis effort to determine the actual path to failure and identify human performance issues and root causes of our equipment problems.

By knowing how to determine the failure mode, failure agent, and failure classification, the professional troubleshooter can do a better job of collecting necessary information that will help perform better root cause failure analysis and subsequently identify the real human performance root causes that often are the underlying reasons causing equipment problem(s). When the real reasons for the equipment problems are known, the professional troubleshooter can recommend and implement corrective actions that will prevent further similar problems and allow increased mean time between failures.

Machinery Troubleshooting Objectives and Traps
Most people recognize the primary objective of conducting machinery troubleshooting is to prevent the equipment and machinery from repeat incidents. Although this may sound obvious, I suggest we might not be as effective at accomplishing this as desired. One of the reasons may be that in our troubleshooting efforts, we don't always take the time to document our efforts. As a result, when we have a failure and we review the machinery history for the equipment or machinery, the entry looks something like the following:

3-25-01--Compressor failed

3-26-01--Compressor online

Unfortunately, this does not indicate what was done during the troubleshooting process and whether actions that were taken were effective or ineffective. A process that makes the documentation easier while conducting the troubleshooting is a desired goal. Additionally, if a decision is made to purchase a new piece of equipment or upgrade an existing piece, it is beneficial to have previous problems documented so the bid specification for the new or upgraded piece of equipment can be written better. This ensures the new or upgraded equipment will be better suited for the desired service.

Too often, what may prevent us from fully realizing our troubleshooting potential are some common troubleshooting traps. The first of these is the "everyone's job and no one's job," where the Operations group decides it is too difficult for them to fix and the Maintenance group decides it is too minor for them to get involved. The result is often just replacing the item or addressing the symptoms only. The second trap is the "Mr. Machinery," where a facility may rely on a certain individual who has been at that facility for a long time. This is not necessarily adverse, but it can be a problem if this individual has not kept up with current technology. The third trap is "telephone troubleshooting," where the troubleshooter attempts to solve the problem by interviewing people over the phone and not taking the opportunity to look at the failed equipment or machinery in person. The fourth and final trap is "familiarity," which can arise if a person becomes very familiar with a piece of equipment or machinery and begins to assume the current problem is the same as the last time, whether or not in fact it is.

Systematic Troubleshooting Process
In order to conduct troubleshooting effectively and avoid the above-mentioned traps, you should follow a logical process. A logical process should be able to determine "what" happened, "why" it happened, and then develop effective fixes for the "why" it happened.

What happened
The "what" happened can effectively be portrayed by creating a sequence of events chart. This provides a graphical presentation of what happened to those reviewing the incident. A common technique is to put the incident or problem in a circle, the actions or events into boxes, and amplifying information in ovals often called conditions.

Let's take a look at an incident that occurs with a cooling pump motor. This pump has been in operation for over four years. Four months ago, it was taken out of service and refurbished. Six weeks ago, it was greased as per the manufacturer's recommendations. On the day of the incident, the pump motor began to smoke, eventually catching fire. After disassembly of the motor, we find the inboard bearing is burned and melted. The outboard bearing appears to be in good condition. We can put this into a sequence of events chart and then begin to troubleshoot using a combination of brainstorming and a cause-and-effect technique.

Why it happened
Option 1: If we are able to gather the right people, we should be able to develop a reasonable list of possibilities for the burned-up bearing. Some of these may include:

lubrication problems,
loading problems,
misalignment problems,
clearance problems, and
friction problems.

Below each of these general categories, we could postulate additional detailed possibilities. Specifically, under the lubrication category we could list insufficient lubrication, over lubrication, incorrect lubrication, etc. We could continue to delve deeper under each of the general categories as the group that we have gathered continues to brainstorm based on members' collective knowledge and experience. This, it is hoped, would get us to at least several different possibilities regarding the Root Cause Failure Analysis for the burned-up failed inboard journal bearing.

Option 2: Another approach to the troubleshooting process could include gathering the right people, but in addition, this group also could use pre-developed and/or existing checklists that contain the most common symptoms of bearing problems and then include the possible causes for each of the listed symptoms. An advantage of using well-developed checklists is the tendency of people to forget one or two symptoms and/or to forget possible causes of the symptoms. Also, if the right people can't be gathered, you may have to rely on people with less experience and knowledge. A checklist helps to overcome these problems. By using checklists, you can ensure each and every troubleshooter is relying on the best available information obtained to create the checklists.

Path to failure
Additionally, if checklists aren't available, you can still use a process to determine the "path to failure" to better understand the manner and conditions that existed to create the failure or poorly performing equipment or machinery. The first step is to determine the Failure Mode. This is the appearance, manner, or form in which a machinery component or unit failure manifests itself. The general categories of Failure Modes include: Deformation, Fracture, Surface/Material Changes, and Displacement. Below each of the general categories we can list more specific forms.

After the Failure Mode, we proceed to the Failure Agents. The Failure Agent is the catalyst that allowed the Failure Mode to occur. The Failure Agents consist of: Force, Reactive Environment, Time, and Temperature. One of these will be the primary, which often creates secondary and tertiary Failure Agents that are exhibited. It is similar to asking "why" as in Option 1 above, only now we have a more structured and systematic process that anyone can use and document.

After the Failure Agent, we proceed to determine the Failure Classification. This is where we determine whether the issue is strictly an Equipment Difficulty or whether there may be a Human Performance Difficulty associated with it. Too often, many troubleshooters stop at this point. They are missing a valuable opportunity to take their troubleshooting to another level. This is where we now involve the "people talking back" part of the investigation.

Human Performance Difficulty
If you determine there was human performance involved in the equipment issue, the troubleshooter will then need to step out of the equipment analysis role and begin to set up interviews with people who have interacted with the equipment or machinery in question.

To begin the process of finding the human performance root causes we can use the Root Cause Tree, which allows us to get the human performance "why" part of the investigation. The process starts by answering the 15 questions on the front of the tree; this will help the troubleshooter better determine which of the basic cause categories are applicable and which basic cause categories are not applicable for the equipment issue being investigated. Once the 15 questions have been answered yes or no, the troubleshooter will turn to the back of the tree and analyze the basic cause categories identified by the 15 questions from the front. Under each of the basic cause categories are root causes the troubleshooter should evaluate to determine whether they apply to the equipment issue. An example might be where a person was performing maintenance and the procedure was not specific enough, causing the maintenance person to have to interpret the intent of the procedure and subsequently making an incorrect interpretation, causing the equipment to fail or not work properly.

To adequately answer the 15 questions and then identify root causes from the Root Cause Tree, the troubleshooter needs to conduct interviews with the appropriate people. The 15 questions and the root causes on the tree are designed to minimize, if not eliminate, the "blame" syndrome at many companies. This does not mean we shouldn't hold people accountable for their actions, only that we need to look at our systems first to ensure they are providing people with the tools they need to be successful in their jobs.

Once we are confident our systems are in order, we can then have a better justification in applying the appropriate discipline that is warranted for the situation.

This column appeared in the February 2006 issue of Occupational Health & Safety.

This article originally appeared in the February 2006 issue of Occupational Health & Safety.

Featured Video

Columbia Southern University

The next step in your safety education is waiting.

“I wanted to go back to school for safety. Once I learned everything CSU had to offer, I knew it was my school. Here I am today with a master’s.” @Columbia Southern University graduate, Lurri Shoenrock.
CSU offers 100% online safety degree programs. Take the next step today!

Product Showcase

Featured

The Path to PFAS Free Verification for Fire Service Fabrics

Milliken secures independent verification from Forever Analytical as the fire service industry reaches an inflection point on PFAS and transparency. Read Now
Future Proofing Electrical Safety in the Data Center Era

As data center construction booms, discover how connected safety tech and advanced training are mitigating rising electrical risks for the 2026 workforce. Read Now
Near-Miss Incidents Caused by Exterior Property Conditions

Don't wait for a fall to take action. Learn why stumbling on uneven ground or dodging overgrown brush are critical warnings for workplace safety. Read Now
Why Industrial Hygiene Controls Fail When Workers Are Alone

Industrial hygiene controls often fail in lone worker environments where supervision and monitoring are limited. Read Now
- Industrial Hygiene
- Lone Worker Safety
Rethinking the Hierarchy of Controls

Too often organizations start with PPE instead of eliminating hazards at the source. Applying the Hierarchy of Controls in the correct order is key to sustainable risk reduction and safer workplaces. Read Now
- PPE
- Risk Management
Mental Health Toolkits Strengthen Jobsite Safety

A structured toolkit helps supervisors recognize warning signs, respond appropriately and connect workers to support. Read Now
- Construction Safety
- Total Worker Health

Artificial Intelligence

How AI Is Changing Confined Space Atmospheric Monitoring
Confined spaces are dynamic environments where gas levels, airflow, and temperature can shift in seconds. AI-based atmospheric intelligence is helping safety teams move beyond threshold alarms to predictive, real-time risk interpretation.
ASSP White Paper Explores AI's Impact on Occupational Safety
New research from the ASSP outlines how AI tools are transforming risk assessment and reporting for EHS experts.
Workplace Safety Confidence Rises, but Protection Gaps Persist
Workers may feel safe, but many can’t identify safety systems or report hazards. Experts warn the growing gap between perception and real protection.

More AI Coverage