The stakes of data cleaning for anomaly detection and predictive maintenance

ENGIE Digital
5 min readJun 27, 2022

By Paul Poncet

To improve the efficiency of ENGIE’s renewable energy assets (such as wind turbines), one of our key levers is to create and deploy predictive maintenance algorithms. This is part of our daily work at ENGIE Digital, and we keep on deploying machine learning models on thousands of industrial assets, as explained in this former article.

In this context, data cleaning is a mandatory step. Starting from our functional and technical understanding of potential data quality issues, we need to identify data artefacts and remove them whenever necessary, before training a machine learning model that might be biased otherwise, or before delivering new predictions.

In this article, we explain what we mean by “data artefacts” or “data surprises”, how we proceed in practice, and what challenges are still to be addressed.

Three kinds of data surprises in time series

Since many of our data science algorithms are made to detect real anomalies that occur in our wind turbines or solar farms, we need to distinguish between real anomalies and data artefacts: data or predictions that are anomalous for reasons independent of the asset’s health or performance.

What is a “real anomaly”? A real anomaly is an unexpected event occurring in an industrial asset that might cause an unavailability of the asset and/or a loss of energy production. Examples of such events (among many others) are:

- mechanical issues (accelerated wear and tear of a bearing in a wind turbine component for instance);

- sensor issues (some sensors might be essential for the safe operation of the asset);

- abnormal control loops (a wind turbine has its own internal regulation loops, should it be for facing main wind direction, restarting, stopping, etc. These loops need to be parametrized to work correctly).

At a conceptual level, we make the distinction between three types of data surprises:

  • “outlier values”;
  • “uncommon values”;
  • “unsupported values”.

Outlier values correspond to physically impossible or very unlikely data records. This is the case e.g. for a wind speed at 1,000 m/s, or a sequence of successive identical values (we talk about frozen values) due to a loss of communication in the data acquisition chain (see Figure 1).

Diagram of a wind turbine’s active power against wind speed
Figure 1: This “S” shape is well known in the wind power industry. It displays active power (in kW) against wind speed (in m/s) measured on a wind turbine with a 10 minutes timestamp. Here, the vertical line (points in red) reveals the presence of frozen values in wind speed measurement, due to a loss of communication in the data acquisition chain.

Uncommon values are plausible data points related to an unusual situation (say the occurrence of a curtailment on a wind turbine for instance, see Figure 2). They may or may not be the sign of a real anomaly having occurred in the asset. But, in any case, at training time we do not need to know the cause of these uncommon values: we just remove them for our anomaly detection model to capture the normal behavior of the asset.

Diagram of a wind turbine’s active power against wind speed
Figure 2: Data points in red correspond to a situation where the wind turbine is curtailed. Curtailment may be part of normal operation (and may occur to limit noise emissions or comply with bat and bird protection legislation for instance) or may be anomalous. At training time we don’t need to know if these points are normal or anomalous: we merely remove them.

Unsupported values correspond to data points that, with respect to a specific topic or business question, must be left aside. Think of data points corresponding to a regular wind turbine stop: an asset being stopped is part of normal operation (see Figure 3). Hence, these points are not “outliers” nor “uncommon”, but are better filtered out to detect real anomalies.

How to define “unsupported values”? A data scientist fitting a regression model of the form y=f(x) + 𝜀 may think of unsupported values as lying outside the domain of the regression function f ; this includes cases where y values are missing.

Diagram of a wind turbine’s active power against wind speed
Figure 3: Data points in red correspond to situations where the wind turbine is stopped, either due to a lack of wind (wind speed less than 3.5 m/s typically), a maintenance action, or an unexpected event. Usually, we consider these points as “unsupported” and filter them out to avoid including them in the training of our models.

Dealing with data surprises

Outlier values are usually cleaned up by combining business rules (e.g. thresholds beyond which a data point is very unlikely) and simple statistical algorithms, then completed with appropriate algorithms if necessary.

To remove uncommon values, we may use either deterministic approaches (for instance we may know from our databases that there was a curtailment on a wind turbine during a given period) or statistical algorithms. In the latter case, simple algorithms are often not sufficient, and custom approaches are required.

We choose not to fill data gaps created by the cleaning of uncommon values. Indeed, the occurrence of uncommon values may tell us that the industrial asset as a whole is in a “uncommon” state, in which case filling gaps would not be robust at all and would disturb the training of an anomaly detection model.

To manage these tasks, we have developed a simple, yet useful dedicated software package that powers all of our downstream predictive maintenance algorithms. This software includes methods for detecting:

  • frozen values (notably related to a loss of communication with the asset / farm);
  • duplicate values;
  • simple outliers;
  • records (highest or lowest values ever observed);
  • outlier values, with statistical algorithms (e.g. Local Outlier Factor, Influenced Outlierness, Isolation Forest, and variants of these algorithms);
  • abnormal values, relying on the comparison of assets meant to behave similarly (think e.g. of wind turbines of the same wind farm).

Conclusion

Data cleaning is a mandatory part of any machine learning pipeline. In practice, we observe that:

  • it acts as a safeguard against undesired behavior of machine learning models at scoring time, and make our predictions more robust and reliable;
  • it contributes to reducing false positive rates when detecting anomalies on our assets; false positives are to be avoided, since we do not want wind turbine operators or maintainers to trigger a potentially expensive inspection or maintenance action for no reason;
  • it helps identifying data quality issues that, once confirmed, may be treated in the upstream data acquisition and storage workflow.

These data cleaning steps can be applied either on-the-fly or at an early stage within databases, as part of a global data quality management process.

Yet, it remains uneasy to value data cleaning algorithms economically. Also, it might be a problem for an end-user to get no predictions/no alerts at a given moment because of data artefacts that have been cleaned up. These are some challenges that we have identified as next steps.

Acknowledgments

I thank my colleagues Céline Mallet and Régis Lavisse for their valuable suggestions to improve the content of this article.

--

--

ENGIE Digital

ENGIE Digital is ENGIE’s software company. We create unique software solutions to accelerate the transition to a carbon-neutral future.