Equipment Troubleshooting

Your equipment can't talk back, but your people can and should.

INDUSTRY has done a very good job at determining how to fix equipment to keep it up and running. And yet, how often do we look at a failed piece of equipment or equipment that has never worked quite right and:

  • blame the manufacturer for the problems
  • tell people it has always worked that way (problem? what problem?)
  • assume someone else will fix it (operations or maintenance)
  • ask Mr. Machinery to fix it (the person who has been around for 30 years)

As we all know, the equipment can't talk back. We may not take the time to conduct a thorough troubleshooting and failure analysis effort to determine the actual path to failure and identify human performance issues and root causes of our equipment problems.

By knowing how to determine the failure mode, failure agent, and failure classification, the professional troubleshooter can do a better job of collecting necessary information that will help perform better root cause failure analysis and subsequently identify the real human performance root causes that often are the underlying reasons causing equipment problem(s). When the real reasons for the equipment problems are known, the professional troubleshooter can recommend and implement corrective actions that will prevent further similar problems and allow increased mean time between failures.

Machinery Troubleshooting Objectives and Traps
Most people recognize the primary objective of conducting machinery troubleshooting is to prevent the equipment and machinery from repeat incidents. Although this may sound obvious, I suggest we might not be as effective at accomplishing this as desired. One of the reasons may be that in our troubleshooting efforts, we don't always take the time to document our efforts. As a result, when we have a failure and we review the machinery history for the equipment or machinery, the entry looks something like the following:

3-25-01--Compressor failed

3-26-01--Compressor online

Unfortunately, this does not indicate what was done during the troubleshooting process and whether actions that were taken were effective or ineffective. A process that makes the documentation easier while conducting the troubleshooting is a desired goal. Additionally, if a decision is made to purchase a new piece of equipment or upgrade an existing piece, it is beneficial to have previous problems documented so the bid specification for the new or upgraded piece of equipment can be written better. This ensures the new or upgraded equipment will be better suited for the desired service.

Too often, what may prevent us from fully realizing our troubleshooting potential are some common troubleshooting traps. The first of these is the "everyone's job and no one's job," where the Operations group decides it is too difficult for them to fix and the Maintenance group decides it is too minor for them to get involved. The result is often just replacing the item or addressing the symptoms only. The second trap is the "Mr. Machinery," where a facility may rely on a certain individual who has been at that facility for a long time. This is not necessarily adverse, but it can be a problem if this individual has not kept up with current technology. The third trap is "telephone troubleshooting," where the troubleshooter attempts to solve the problem by interviewing people over the phone and not taking the opportunity to look at the failed equipment or machinery in person. The fourth and final trap is "familiarity," which can arise if a person becomes very familiar with a piece of equipment or machinery and begins to assume the current problem is the same as the last time, whether or not in fact it is.

Systematic Troubleshooting Process
In order to conduct troubleshooting effectively and avoid the above-mentioned traps, you should follow a logical process. A logical process should be able to determine "what" happened, "why" it happened, and then develop effective fixes for the "why" it happened.

What happened
The "what" happened can effectively be portrayed by creating a sequence of events chart. This provides a graphical presentation of what happened to those reviewing the incident. A common technique is to put the incident or problem in a circle, the actions or events into boxes, and amplifying information in ovals often called conditions.

Let's take a look at an incident that occurs with a cooling pump motor. This pump has been in operation for over four years. Four months ago, it was taken out of service and refurbished. Six weeks ago, it was greased as per the manufacturer's recommendations. On the day of the incident, the pump motor began to smoke, eventually catching fire. After disassembly of the motor, we find the inboard bearing is burned and melted. The outboard bearing appears to be in good condition. We can put this into a sequence of events chart and then begin to troubleshoot using a combination of brainstorming and a cause-and-effect technique.

Why it happened
Option 1: If we are able to gather the right people, we should be able to develop a reasonable list of possibilities for the burned-up bearing. Some of these may include:

  • lubrication problems,
  • loading problems,
  • misalignment problems,
  • clearance problems, and
  • friction problems.

Below each of these general categories, we could postulate additional detailed possibilities. Specifically, under the lubrication category we could list insufficient lubrication, over lubrication, incorrect lubrication, etc. We could continue to delve deeper under each of the general categories as the group that we have gathered continues to brainstorm based on members' collective knowledge and experience. This, it is hoped, would get us to at least several different possibilities regarding the Root Cause Failure Analysis for the burned-up failed inboard journal bearing.

Option 2: Another approach to the troubleshooting process could include gathering the right people, but in addition, this group also could use pre-developed and/or existing checklists that contain the most common symptoms of bearing problems and then include the possible causes for each of the listed symptoms. An advantage of using well-developed checklists is the tendency of people to forget one or two symptoms and/or to forget possible causes of the symptoms. Also, if the right people can't be gathered, you may have to rely on people with less experience and knowledge. A checklist helps to overcome these problems. By using checklists, you can ensure each and every troubleshooter is relying on the best available information obtained to create the checklists.

Path to failure
Additionally, if checklists aren't available, you can still use a process to determine the "path to failure" to better understand the manner and conditions that existed to create the failure or poorly performing equipment or machinery. The first step is to determine the Failure Mode. This is the appearance, manner, or form in which a machinery component or unit failure manifests itself. The general categories of Failure Modes include: Deformation, Fracture, Surface/Material Changes, and Displacement. Below each of the general categories we can list more specific forms.

After the Failure Mode, we proceed to the Failure Agents. The Failure Agent is the catalyst that allowed the Failure Mode to occur. The Failure Agents consist of: Force, Reactive Environment, Time, and Temperature. One of these will be the primary, which often creates secondary and tertiary Failure Agents that are exhibited. It is similar to asking "why" as in Option 1 above, only now we have a more structured and systematic process that anyone can use and document.

After the Failure Agent, we proceed to determine the Failure Classification. This is where we determine whether the issue is strictly an Equipment Difficulty or whether there may be a Human Performance Difficulty associated with it. Too often, many troubleshooters stop at this point. They are missing a valuable opportunity to take their troubleshooting to another level. This is where we now involve the "people talking back" part of the investigation.

Human Performance Difficulty
If you determine there was human performance involved in the equipment issue, the troubleshooter will then need to step out of the equipment analysis role and begin to set up interviews with people who have interacted with the equipment or machinery in question.

To begin the process of finding the human performance root causes we can use the Root Cause Tree, which allows us to get the human performance "why" part of the investigation. The process starts by answering the 15 questions on the front of the tree; this will help the troubleshooter better determine which of the basic cause categories are applicable and which basic cause categories are not applicable for the equipment issue being investigated. Once the 15 questions have been answered yes or no, the troubleshooter will turn to the back of the tree and analyze the basic cause categories identified by the 15 questions from the front. Under each of the basic cause categories are root causes the troubleshooter should evaluate to determine whether they apply to the equipment issue. An example might be where a person was performing maintenance and the procedure was not specific enough, causing the maintenance person to have to interpret the intent of the procedure and subsequently making an incorrect interpretation, causing the equipment to fail or not work properly.

To adequately answer the 15 questions and then identify root causes from the Root Cause Tree, the troubleshooter needs to conduct interviews with the appropriate people. The 15 questions and the root causes on the tree are designed to minimize, if not eliminate, the "blame" syndrome at many companies. This does not mean we shouldn't hold people accountable for their actions, only that we need to look at our systems first to ensure they are providing people with the tools they need to be successful in their jobs.

Once we are confident our systems are in order, we can then have a better justification in applying the appropriate discipline that is warranted for the situation.

This column appeared in the February 2006 issue of Occupational Health & Safety.

This article originally appeared in the February 2006 issue of Occupational Health & Safety.

Product Showcase

  • Kestrel 5400 Heat Stress Tracker WBGT Monitoring for Workplace Safety

    Ensure safety with the Kestrel® 5400 Heat Stress Tracker, the go-to choice for safety professionals and endorsed by the Heat Safety & Performance Coalition. This robust, waterless WBGT meter is ideal for both indoor and outdoor environments, offering advanced monitoring and data logging essential for OSHA compliance. It features pre-programmed ACGIH guidelines and alert settings to quickly signal critical conditions. Integrated with the cloud-based Ambient Weather Network, the 5400 allows managers to view, track, and log job site conditions remotely, ensuring constant awareness of potential hazards. Its capability for real-time mobile alerts and remote data access promotes proactive safety management and workplace protection, solidifying its role as a crucial tool in industrial hygiene. 3

  • Matrix's OmniPro Vision AI Collision Avoidance System

    OmniPro Vision AI is a state-of-the-art collision avoidance system that features NIOSH award-winning Visual Artificial Intelligence (AI) technology. This highly accurate, powerful system identifies and alerts on pedestrians, vehicles and specified objects, ensuring safer facilities, mining operations and industrial sites. With its web-based cloud application, OmniPro Vision AI also logs and analyzes a wide range of data related to zone breach notifications. Operating without needing personal wearable devices or tags, OmniPro has visual and audible zone breach alerts for both operators and pedestrians. 3

  • AirChek Connect Sampling Pump

    Stay connected to your sampling with the SKC AirChek® Connect Sampling Pump! With its Bluetooth connection to PC and mobile devices, you can monitor AirChek Connect pump operation without disrupting workflow. SKC designed AirChek Connect specifically for all OEHS professionals to ensure accurate, reliable flows from 5 to 5000 ml/min and extreme ease of use. AirChek Connect offers easy touch screen operation and flexibility. It is quality built to serve you and the workers you protect. Ask about special pricing and a demo at AIHA Connect Booth 1003. 3

Featured