Predictive maintenance: from prediction to decision making

6 min readMay 4, 2022

At ENGIE Digital, we decided to develop a platform that performs predictive maintenance to help Operation & Maintenance (O&M) technicians optimize their interventions: to limit unnecessary maintenance and anticipate operations before the failure or loss of efficiency of the assets.

For this purpose, we built a platform that manages (by training, predicting, labeling, retraining, simulating) thousands of machine learning models. In this article we explain how we make the predictions from these models usable for the O&M technician — the final user.

Use case: predictive maintenance for industrial assets

The machine learning models deployed for predictive maintenance on industrial assets in our platform have two functions: detect anomalies and forecast drifts. To achieve these objectives, we learn the historical path of data during “healthy” periods, when assets are in a good shape. From these models, and on the basis of new data collected, we infer whether there is significant gap with the expected behavior — the underlying assumption being that maintenance is required to return the asset to a standard operating mode.

Graph of an anomaly detection for a boiler — *Anomaly detection validated on a time series for a boiler*

Once this Machine Learning process is executed, the value of predictive maintenance is still not realized until the technicians are able to take action and enhance their maintenance processes.

We have developed the green boxes in the schema below to upgrade our solution from a Machine Learning platform to a Decision-Making one.

Alerting rules

First, the technician must be alerted that they must take action on the asset they maintain. This alert should be comprehensive and significant.

Machine learning jobs output a list of data points that are predicted as anomalous and may have little meaning for the technician. Therefore, we transform these outputs into business alerts that inform the technician on the maintenance to be planned. It is therefore mandatory to link the anomaly detection with a technical defect type. It is also useful to associate the defect type with a level of criticality. For example, a detection of abnormal clogging on a filter may be less urgent than having a trouble with the compressor of a chiller.

Anomaly detection is very sensitive to the quality of the data, which can occasionally be affected by measurement quality or external noise. For example, a vibratory measurement on a motor can be altered by work being performed next to the asset. Given that the objective of predictive maintenance is to detect progressive drifts on the assets, it is likely that the anomaly detected by machine learning models should last longer than the noise on the measurement. We must then verify the persistence of an anomaly before triggering an alert.

Sometimes, it is also relevant to combine evaluation of anomaly occurrence on several sensors to be more precise on the defect type to raise an alert on. For example, if we detect an anomaly on the speed of a fan but not on the vibration, it can help diagnosis a loss of charge in the aeraulic network instead of a defect on the motor.

For these reasons, ENGIE Digital chose to develop a “business rule engine” on top of the machine learning outputs. It consists in a form where the business expert can configure the combination of rules, the duration of the anomaly and the defect type and the urgency level associated with an alert.

Screenshot of the user interface, showing a list of active business rules for a temperature control valve — *User interface: list of active business rules for a Temperature Control Valve*

Alerting mode

There are multiple personae in maintenance activity that should be alerted at different frequencies and with different accessibility modes; to be fully usable the alerts provided by the predictive maintenance solution must be displayed in several modes:

In-app alerts. The manager or the maintenance expert, for example a site manager, wants to follow up closely the health of the assets they maintain in order to schedule in advance the maintenance workload and to anticipate the parts they have to source. Therefore, a user interface that monitors the health of all the assets, their history, and the graphical insights to investigate the root cause and severity of the alerts is useful to help they make the right decisions on maintenance planning.
Email. Technicians on one side and business managers on the other have limited time to check daily the applications dedicated to their site maintenance management. Nevertheless, they should be aware as soon as possible in case of an upcoming failure. A solution is to push them the alerts that are considered critical over email.
Interface with maintenance software ecosystem. Predictive maintenance is often reserved to critical assets. The rest of the maintenance is commonly monitored via other applications like hypervision systems or work order ticketing software. Alerts coming from predictive maintenance solutions should also be pushed to these systems to be taken into acount and prioritized against other planned maintenance tasks.

For these three use cases, ENGIE Digital developed its own front-end application with the maximum information for the site/expert manager. Every morning the application triggers an email to authorized users with a list of open alerts depending on their criticality level, and offers an interface to push the alerts to SAP.

Screenshot of the user interface showing the details of an alert — *User Interface: details of an alert*

User feedback

Predictive maintenance algorithms can trigger false alarms for several reasons:

The training dataset is too poor to enable the algorithms to learn on all valid functioning patterns of the asset
The alerting rule is too sensitive to occasional outliers
The asset is running in restricted mode during a period of time
The asset’s configuration has changed (settings, renewed parts, etc.)

Depending on these situations, it is possible to adapt the machine learning algorithms and/or the business rules to reduce the rate of false alarms (learn on new dataset, update the models’ specifications, change the business rules, filter the data…).

To take the most trusted decision, the predictive maintenance system must collect feedback from the user on the alerts that have been triggered.

As an example, in the user interface developed by ENGIE Digital, we clearly invite the user to qualify the alerts and the possible root cause in case of false alarm: in one page all the information relative to the alert is displayed in order to help the user understand the origin of the alert. His feedback triggers different actions like: mute the alert for a period of time, mute the alert with thresholds on data measurements or run a diagnosis of the model to update the predictions (find new features, new training set).

Screenshot of the user interface showing a user feedback — *User Interface : user feedback*

It is a real challenge to make users volunteer to give their feedback, as it is the best way for the system to recommend an accurate solution. Letting the solution drift with too much false alarms will decrease the trust of the user in the solution, which will end with a true alert that may be overlooked by the maintainer.

Conclusion

To be fully actionable, the predictive maintenance system must be trusted by the user, this means transparency on the data that is fed to the algorithms and proactivity to update the predictions with user feedbacks. A good advice is to work with designers and a pool of users to understand their user journey, to test prototypes to collect feedback, and to adapt the alerting process and the user interface as necessary.