s

A closer look at the data and data cleaning



A closer look at the data as well as some data cleaning.

The scatter plot of our dataset did show three distinct sections of the curve, the first section at low values of wind speed where the wind turbine values are clustered around zero (including the 10% of observations that have zero power values), the second section where there seems to be an increasing linear trend between wind speed and power output and thirdly the last section where the values of power have reached a peak at higher wind speeds and level off, and also the outliers here. We would expect to see power being generated when speed is above 3-4 metres per second. The scatter plot shows power output values below wind speeds of below this. The cut-out speed from the scatter plot and values is 24.4 metres per second. The cut back in speed after cut out is 20 meters per second. This might explain some of the variation in power values around this speed range in our dataset. There are no null values in the dataset but there are 49 datapoints where power value is zero representing 10% of the observations. There is a single data point where speed value is zero as well as power.

  • Ten speed values over 24.4 metres per hour with a corresponding power value of zero
  • There are no data points with speeds above 24.4 and power not equal to zero
  • Eight data points where speed is less than 8 metres per second and power value is zero
  • Four datapoints where speed is greater than 8 and less than 24.4 metres per second and corresponding power values are zero
  • 24 datapoints where speed is less than 4 metres per second and power is zero
  • The cut-in speed is typically 3 to 4 meters per second. There are 17 observations where speed is less than 3 metres per second.

Non-zero power values where speed is less than 3 metres per second:

There are 39 observations where speed is less than the cut-in speed (of between 3 and 4 metres per second) and the corresponding power value is greater than zero. I think these will be the tricky ones to predict. There was mention in an article referenced above [4] that while sometimes there might not be enough wind to turn a turbine, the wind energy is not lost as the wind energy can be stored in energy storage systems for later use whenever wind levels are low.

Rated speed

At the rated wind speed, the turbine is able to generate electricity at its maximum, or rated, capacity. The rated speed is usually in the range of 25 to 35 mph. (equivalent to 11.18 to 13.4 metres per second). This is based on the research above but we don’t know the details of the actual wind turbines being used here. The rated speed for the turbines in this dataset must be greated than this range as they do not reach their maximum capacity until wind speed is greater than 15 metres per second. The max power value in the dataset is 113.556 kws but there are very few observations in the dataset where power is greater than 110. The power curve for this dataset shows the power values levelling off in and around values between 90 and 100.

Unique values in the dataset:

# show the unique values
df.speed.unique()
len(df.groupby('speed')) # 490
# a count of the unique speed values
df.speed.value_counts() # 490
# a count of the unique power values
df.power.value_counts() # 451
df.groupby('speed')['power'].agg(['count', 'size', 'nunique'])
len(df.groupby('power')) # 451

df.groupby('power')['speed'].agg(['count', 'size', 'nunique'])
len(df[['speed', 'power']].drop_duplicates()) # 490
# the rows where both values are zero
df[(df['speed']==0) & (df['power']==0)].count() # only 1
df[df.power==0].count() # 49
speed    49
power    49
dtype: int64

Number of zero wind speed values in the dataset:

print("The number of observations with a zero power value are as follows:")
print(f"For speed values below 3 metres per second: {df[(df.speed <3) & (df.power==0.0)].count()[0]}")
print(f"For speed values is between 3 and 4 metres per second: {df[(df.speed >3) & (df.speed <4) & (df.power==0.0)].count()[0]}")
print(f"For speed values between 4 and 7 metres per second: {df[(df.speed >4) & (df.speed<7) & (df.power==0.0)].count()[0]}")
print(f"For speed values between 7 and 24.4 metres per second: {df[(df.speed >7) & (df.speed<24.4) & (df.power==0.0)].count()[0]}")
print(f"For speed values above 24.4 metres per second: {df[(df.speed>24.4) & (df.power==0.0)].count()[0]}")
The number of observations with a zero power value are as follows:
For speed values below 3 metres per second: 17
For speed values is between 3 and 4 metres per second: 7
For speed values between 4 and 7 metres per second: 9
For speed values between 7 and 24.4 metres per second: 6
For speed values above 24.4 metres per second: 10
df[(df.speed < 3) & (df.power>0.0)].count() # 39
df[(df.speed < 3) & (df.power==0.0)].count() # 17
df[(df.speed < 4) & (df.power>0.0)].count() # 56
df[(df.speed < 4) & (df.power==0.0)].count() # 24
speed    24
power    24
dtype: int64

Number of Power Output values greater than 100:

print(f"Number of data points where power is between 80 and 90 kws: {df[(df.power>80)&(df.power<90)].count()[0]}")
print(f"Number of data points where power is between 90 and 100 kws: {df[(df.power>90)&(df.power<100)].count()[0]}")
print(f"Number of data points where power is between 100 and 110 kws: {df[(df.power>100)&(df.power<110)].count()[0]}")
print(f"Number of data points where power is greater than 110 kws: {df[(df.power>110)].count()[0]}")
Number of data points where power is between 80 and 90 kws: 31
Number of data points where power is between 90 and 100 kws: 95
Number of data points where power is between 100 and 110 kws: 55
Number of data points where power is greater than 110 kws: 2

Cut-out speeds for safety reasons:

At the cut-out wind speed, the turbine shuts down to avoid damage. There are 10 observations that fall into the cut-out speed range. There are no datapoints at all where speed is above the cut-out speed value and power value is not zero.

# there are no points in the dataset where speed is greater than 24.4 and power value is not zero
print(f"Number of observations where speed values is above 24.4 metres per second: {df[df.speed>24.4].count()[0]}") # 
print(f"Number of observations where power values is zero and speed is above 24.4 metres per second: {df[(df.speed > 24.4) & (df.power==0.0)].count()[0]}")
Number of observations where speed values is above 24.4 metres per second: 10
Number of observations where power values is zero and speed is above 24.4 metres per second: 10

Speeds between 7 and 8 metres per second:

df[(df.speed >7) & (df.speed < 8) & (df.power==0.0)].count()
speed    2
power    2
dtype: int64
df[(df.speed >11) & (df.speed< 13.4)].count()
speed    49
power    49
dtype: int64

Data cleaning:

I debated whether to leave in the observations with zero power values as they do convey some information about the dataset. However after initially trying out the neural network using the complete dataset, the cost did not fall as much as desired and the resulting plots suggested that keeping the zero power values in for the very high values of speed were throwing things out. The research above showed that there is a cut-out speed between 24 and 25 metres per second for safety reasons. At the cut-out wind speed, the turbine shuts down to avoid damage. This is enough to justify excluding these observations as we can predict that the power output will be zero when the wind speed exceeds this cut out value. There are only ten observations in the dataset that fall into this range. We can only predict values for power when the turbines are turned on and therefore maybe the model should only be predicting for values of speed where the turbine is on!

While there is only one zero value for the speed variable, there are 49 zero values for the power variable. These mostly occur below a certain value of speed but located alongside non-zero power values and there are a few that are associated with medium and higher speed values of speed. Most of the data points in the dataset are unique values. The one datapoint with a zero speed value has a zero power value as expected.

Summary of where the zero power values occur:

  • 17 where speed is less than 3 metres per second
  • 7 where speed is between 3 and 4 metres per second. This is the cut-in speed
  • 9 where speed lies between 4 and 7
  • 6 where speed is between 7 and 24.4
  • 10 where speed is above 24.4 metres per second. This is the cut-out value.

Wind turbines generate electricity at wind speeds of 4 – 25 metres per second. For now I will drop all observations where speed is greater than the cut-out value of 24.4 metres per second. I will also drop the observations above the cut-in speed of between 3 and 4 metres per second. I will leave in all the observations where speed is less than the cut-in speed including the zero values. Therefore I am dropping 25 observations from the dataset where the power values are zero.

  • Ten observations where the wind speed is greater than the cut-off value; the corresponding power is zero as the turbines are off.
  • Fifteen observations where wind speed is greater than the cut-in and less than the cut-out and the power is zero.

I am assuming that these represent points where the turbines have been turned off for maintenance or other reasons. I may revisit this. I will make a copy of the dataframe for this purpose.

Excluding the zero values results in the models over predicting the power values for the higher values of speed. Leaving in the zero values does pull the curves back down but only after the max speed has been exceeded.

#https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html
# https://thispointer.com/python-pandas-how-to-drop-rows-in-dataframe-by-conditions-on-column-values/
#df.drop(df.loc[(df.speed>24.4)].index, inplace=True)
# make a copy of the dataframe
dfx = df.copy()
# how many rows where wind speed above the cut-in speed and power is zero
dfx.loc[(dfx.speed > 4)&(dfx.power == 0)].count()
speed    25
power    25
dtype: int64
# drop rows from the new dataframe 
dfx.drop(dfx.loc[(df.speed > 4)&(dfx.power == 0)].index, inplace=True)
# summary statistics
dfx.describe()
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}


Tech used:
  • JavaScript
  • CSS
  • HTML