An efficient data science approach to creating asset groups

Left: Turbine clusters (image by author). Right: Photo by Philip May, CC BY-SA 3.0, via Wikimedia


Operating wind turbines generate streams of data while producing clean and renewable electricity for our daily use. The data is a time series of environmental, mechanical, and production variables and are obtained using the Supervisory Control and Data Acquisition (SCADA) system.

Wind energy analyses typically require preprocessing of the SCADA data including identifying which turbines may be considered “neighbors”. Where the concept of being neighbors depends on the variable of interest such as turbine location, wind speed, wind direction, and power output.

For instance, geographically, two or more turbines may be considered neighbors if their latitude and longitude are closer to each other compared with the remaining turbines. More technically, two turbines may also be grouped as neighbors based on wind speed if they experience similar wind speeds within the period under investigation.

Applications of Wind Turbine Clustering

Grouping of turbines in a wind farm is a useful data preprocessing step that needs to be performed relatively frequently and for non-geographic variables, the relationship between the turbines may change over time. Some useful applications of wind turbine grouping include:

  • Handling missing and spurious data: Identifying a group of turbines that is representative of a given turbine provides an efficient way to backfill missing or spurious data with the average of neighbors’ sensor data. This is especially useful for variables like wind speed because anemometers tend to have relatively low reliability.
  • Side-by-side analysis: In this analysis, control turbines are selected as a representative of a test turbine usually based on produced power. At the end of a turbine upgrade, it is necessary to measure the performance improvement of the test turbine by comparing its production with that of the neighbor(s). Using the clustering-based approach, in this case, needs to be explored further and compared with existing methods.
  • Group power forecasting: To reduce the computational cost, power forecast models may be built for groups of turbines in the wind farm rather than for individual turbines. In addition, this approach is expected to give more accurate results than building a single model for the whole farm.
  • Group yaw control: An improvement in wind farm energy production may be achieved by implementing an optimal yaw control strategy for a group of turbines that are neighbors in the sense of their position relative to the wind direction rather than individually.

Although clustering techniques have been used in different areas of wind energy analysis such as wind farm power forecasting and yaw control, this article proposes an extension of this approach to other applications such as side-by-side analysis and handling of missing or spurious SCADA data.

Clustering-based SCADA Data Analysis

One method of turbine grouping involves calculating the sum of squared differences (SSD) between the measurement from a turbine and other turbines on the farm. Where turbines with the least SSD are selected as neighbors. This method can be computationally expensive especially if scripted by non-programming experts.

Another method of grouping turbines in a wind farm employs the correlation coefficient between different turbines based on the variable of interest. This method is simple and not computationally expensive but may not be useful for applications such as handling missing values.

The clustering-based approach employs existing state-of-the-art data science techniques and tools that are publicly available. The most popular of such methods is K-Means clustering. This method ensures that turbines in a group not only have minimal Within-Cluster Sum of Squares but also maximal Between-Cluster Sum of Squares with turbines in other groups.

While K-Means clustering is similar to the SSD approach, it also optimizes for between-cluster dissimilarity which is expected to improve its robustness. In addition, the method applies to broader wind turbine analyses than the correlation coefficient approach and can elegantly handle multiple variables. Hence, this approach is more efficient for SCADA data analysis.

Additionally, clustering techniques are well-researched in data science and can be easily implemented in a few lines of code. Other clustering methods that may be explored include hierarchical, spectral, and density-based clustering methods.

Case Study:

Identifying turbine groups and handling missing wind speed data using cluster statistics.

In this example, we show an efficient yet simple data science approach to creating turbine groups in a wind farm using the K-Means clustering technique from the Sklearn library. We also compare two methods of predicting missing wind speed data using obtained clusters.

First, let’s import the relevant libraries

# Import relevant libraries
import os
os.environ['OMP_NUM_THREADS'] = "1"
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error, mean_absolute_percentage_error
from scada_data_analysis.modules.power_curve_preprocessing import PowerCurveFiltering

The Data

The dataset is publicly available on Kaggle and used with the necessary citation. It consists of 134 operational turbines over 245 days and at 10 minutes resolution. Location data was also provided for each unit. The data was loaded as follows:

# Load data
df_scada = pd.read_csv('')
df_loc = pd.read_csv('sdwpf_baidukddcup2022_turb_location.csv')

Data Exploration

Let’s ensure the data was properly read by inspecting the head of the datasets.

Image by author
Image by author

The raw SCADA data consists of 4.7 million rows and 13 columns.

Data Cleaning

Let’s extract the relevant columns namely turbine unique identification (TurbID), day (Day), timestamp (Tmstamp), wind speed (Wspd), wind direction (Wdir), nacelle direction (Ndir), and power output (Patv).

# Extract desired features
dff_scada = df_scada[['TurbID', 'Day', 'Tmstamp', 'Wspd', 'Wdir', 'Ndir', 'Patv']]

Now, let’s inspect the missing values and remove the affected rows. Before removing the missing values, the data quality is visualized below:

Image by author

Only 1.05% of the total rows were removed due to missing values and the new data quality is displayed below:

Image by author

Next, we create a unique Day-Time string for creating a time series of the variables as needed.

dff_scada['date_time'] = dff_scada[['Day', 'Tmstamp']].apply(lambda x: str(x[0]) + 'T' +str(x[1]), axis=1)

Data Filtering

Raw SCADA data can be quite messy and requires filtering to extract the typical operational behavior of each turbine. We will employ the open-source scada-data-analysis library which has a power curve filtering tool for this step. The GitHub repository for the library can be found here.

Image by author

The code used for data filtering and resulting cleaned scada data are shown below. First, we remove spurious data based on domain knowledge.

# set parameters
cut_in_Speed = 2.85
# Initial filtering of obviously spurious data
ab_ind = dff_scada[(dff_scada['Wspd'] < cut_in_speed) & (dff_scada['Patv'] > 100)].index
norm_ind = list(set(dff_scada.index).difference(set(ab_ind)))
assert len(dff_scada) == len(norm_ind) + len(ab_ind)
scada_data = dff_scada.loc[norm_ind, :]

Then, we use the power curve filter to remove abnormal operating data.

# Instantiate the Power Curve Filter
pc_filter = PowerCurveFiltering(turbine_label='TurbID',
windspeed_label='Wspd', power_label='Patv', data=scada_data, cut_in_speed=cut_in_speed, bin_interval=0.5, z_coeff=1.5, filter_cycle=15, return_fig=False)
# Run the data filtering module
cleaned_scada_df, _ = pc_filter.process()
Image by author

The cleaned data has 1.9 million rows and is more representative of the expected relationship between the windspeed and power output for operational wind turbines.

Now, we create test data for evaluating the performance of the clustering approach when predicting missing values. The test data is randomly sampled from the cleaned data and has a similar prevalence (1.05 %) as in the original dataset.

Image by author

Data Transformation

In this step, we transform the filtered data for each of the desired variables which makes it ready for clustering analysis. The transformed data for wind speed is shown below:

Image by author

Cluster Modeling

We would like to cluster the turbines based on wind speed, wind direction, nacelle direction, and power output. The idea is to identify which group of turbines can be considered neighbors and used for the statistical representation of a given turbine.

We use the KMeans algorithm in this example. Selection of the optimal number of clusters is critical for setting up the model and the popular Elbow method is employed for this purpose. The elbow plots for all cases are shown below:

Image by author

The optimal number of clusters chosen based on individual and combined features is 3 although using 4 or 5 clusters also gave reasonable results.

We used the standard scaler tool in the Sklearn preprocessing module to scale the input data in the case of all features since they have different orders of magnitude.


We created a clustering model based on each variable and all variables combined and identified the turbine groups for these cases. The results are shown below:

Image by author

In the results above, the wind speed and wind direction turbine clusters exhibit a similar pattern where turbines in group one are on the edge of the park which makes sense due to the same reduced level of obstructions at both locations especially if the predominant wind direction is along the X axis. In addition, groups 2 and 3 are found in the middle of the park across the columns (along the X axis).

In the power output cluster result, group 1 turbines are found on the right side of the X axis with a clearly defined boundary. Group 2 turbines are in the middle of the park and along the X axis while group 3 turbines are the largest group and occupy mostly the edges of the park.

The nacelle direction is different from the rest of the variables because it depends strongly on the yaw logic applied to the turbine and may vary across the site. Hence, the clusters do not have a clear boundary. However, this analysis may be useful to troubleshoot underperformance related to yaw misalignment when combined with production data.

The clustering result using all features combined is similar to the wind speed cluster and is shown in the article header picture.

Cluster-based approach for missing value imputation

Here, we will explore two cluster-based approaches for handling missing values based on the wind speed data namely naive clustering (NC) and column-sensitive clustering (CSC). Both methods were intuitively named by the author.

Naive clustering

In this approach, the missing value for a given turbine is replaced by the mean (or median) value for the cluster in which the turbine belongs at the desired timestamp. We will use the mean cluster value for this analysis.

Column-sensitive clustering

This method extends the naive clustering approach by taking the mean value for only turbines in the same cluster and column. This considers the effect of the geographical location of the turbines and may be especially more accurate for variables like wind speed.

Cluster Model Evaluation

The test data consists of 19,940 rows and contains the ground truth of the missing wind speed data to be predicted using the cluster approaches.

The naive clustering approach can fill 99.7% of the missing values based on available data in the training cluster data whereas the column-sensitive method can only fill 93.7% due to lesser training data when the clusters are further binned into columns.

Both methods are evaluated using the mean absolute error (MAE) metrics based on the missing values they can predict. In addition, the mean absolute percentage error (MAPE) metric is used to evaluate the non-zero predictions. For both metrics, the smaller the better in terms of model performance.

The SCS approach gave a 2% and 8% improvement over the NC approach based on the MAE and MAPE respectively. However, it fills fewer missing values. Hence, complementary use of both approaches will offer greater benefits.

To have a decent visual comparison of the results we plot 100 randomly sampled points from the test data as shown below:

Image by author

Next, we visualize the missing value imputation errors for both approaches. The imputation error is the difference between the ground truth wind speed value and the predicted value.

Image by author

Both approaches have well-behaved imputation errors which are symmetric about the zero error positions.


In this article, we performed clustering-based SCADA data preprocessing using different individual and combined variables. In addition, we predicted missing wind speed data using naive and column-sensitive clustering approaches. Finally, we inferred from the analysis that both methods can be complementary for greater benefits.

I hope you enjoyed reading this article, until next time. Cheers!

Don’t forget to check other stories on applying state-of-the-art data science principles in the renewable energy space.


Zhou, J., Lu, X., Xiao, Y., Su, J., Lyu, J., Ma, Y., & Dou, D. (2022). SDWPF: A Dataset for Spatial Dynamic Wind Power Forecasting Challenge at KDD Cup 2022. arXiv.

Wind energy analytics toolbox: Iterative power curve filter

Clustering-based data preprocessing for operational wind turbines was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.