TIME SERIES ANALYSIS WITH POWER METERS, PART 1
To better follow the energy consumption, the government wants energy suppliers to install smart meters in every home in England, Wales and Scotland. There are more than 26 million homes for the energy suppliers to get to, with the goal of every home having a smart meter by 2020.
In this dataset, you will find a fraction of the data from the London data store, that contains the energy consumption readings for a sample of 50 London Households that took part in the UK Power Networks between November 2011 and February 2014. The data from the smart meters seems associated only to the electrical consumption.
The purpose of this notebook is to make you familiar with some building blocks of Time Series analysis. It follows a pedagogical flow.
I used Datacamp.com, AnalyticsVidhya.com and Kaggle.com as resource. The data itself is available on Kaggle.
from google.colab import files
uploaded = files.upload()
In this chunk number 100, the figures are given as daily aggregated numbers. There are lots of blocks of csv files. At the moment we are only looking at block number 100.
print (uploaded['block_100.csv'][:200].decode('utf-8') + '...')
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set()
import io
df = pd.read_csv(io.StringIO(uploaded['block_100.csv'].decode('utf-8')))
df.head()
LCLid s are the ID numbers of Power Meters. They have 1 record per day in this csv file. But how many power meters do we have here?
len(df.groupby(['LCLid']))
how many days?
len(df['day'].unique())
The complete dataset can be quite complex, so it is better to start small. One way to do it can be picking up one power meter randomly and looking into it.
import random
random.seed(10)
my_powermeter = random.choice(df['LCLid'])
#pm stands for power meter
pm_data = df[df['LCLid'] == my_powermeter]
pm_data.info()
convert the day column to date format
pm_data.day = pd.to_datetime(pm_data.day)
#and set index inplace!
pm_data.set_index('day', inplace=True)
pm_data.head() #see that now date is the index
pm_data.info()
Now we will see the distribution of median energy consumption and spot if there are some anomalies
pm_data['energy_median'].describe()
It is a huge difference between the median and the max value. There is potentially something fishy there. A simple boxplot can help seeing that...
plt.figure(figsize=(5,5))
sns.boxplot(y = pm_data['energy_median'], data = pm_data)
We will replace those observations greater than 0.25 with 0.25 for this data at this point. They need more attention thou. Maybe these are not measurement errors but some special events which should not be excluded. However, we will do this manuplation for our exploration purposes. You can try median imputation for yourself.
pm_data['energy_median'] = [0.25 if i > 0.25 else i for i in pm_data['energy_median']]
Now try a bit of exploration without those outliers
pm_data['energy_median'].plot(figsize=(18,7), linewidth=5, fontsize=20)
plt.xlabel('Date', fontsize=20);
This increase in winter confirms our intuition right? We can try some rolling average also, or moving average...
Median = pm_data[['energy_median']]
Median.rolling(10).mean().plot(figsize=(18,7), linewidth=5, fontsize=20)
plt.xlabel('Date', fontsize=20);
Detrending our data is easy by taking First-Differences. It is the difference between the successive data points.
pm_data['energy_median'].diff().plot(figsize=(18,7), linewidth=3, fontsize=20)
plt.xlabel('Date', fontsize=20);
realize that the peaks in winter are still visible. That is up to 8-9% increase
Differencing is helpful in converting your time series into a stationary time series. Stationary data don't change over time. This is good because many time series forecasting methods are based on the assumption that the time series is approximately stationary. It can turn out that our data is already stationary and we dont need this transformation thou...
There is also a concept of autocorrelation. A time series is periodic if it repeats itself at equally spaced intervals like this one (every 12 months)
Another way of thinking about this is that the time series is correlated with itself shifted by 12 months. That means that, if you took the time series and moved it 12 months backwards or forwards, it would map onto itself in some way.
Considering the correlation of a time series with such a shifted version of itself is captured by the concept of autocorrelation.
plt.figure(figsize=(18,6))
pd.plotting.autocorrelation_plot(pm_data['energy_median'])
The grey lines are the lines of statistical significance. We can say there is some autocorrelation until 50 lags. But we can not say that it is perfect as it is in a decline from 50% to 25%. If the blue line goes inside the grey lines then the autocorrelation is not significant.
Let's see if the trends are similar accross all these power meters first. From this line onwards we will look at daily total energy consumption, not the median.
g = sns.FacetGrid(df, col="LCLid", col_wrap=7, size=2)
g = g.map(plt.plot, "energy_sum", marker=".")
In general we can say that there are big and small consumers. We observe curious spikes also time to time
In order to work on different power meters in a simpler way, we can create a small Class. First method in the class asks the power meter id to the user and makes a time series object. The other methods in the class replicate the plots by using that time series object
#pm is a powermeter class
class pm():
#assumes all above libraries are available
def make_pm_df(df):
powermeter_id = input()
data = df[df['LCLid'] == powermeter_id]
data.day = pd.to_datetime(data.day)
data.set_index('day', inplace=True)
return data
def anomaly_checker(data, variable):
plt.figure(figsize=(5,5))
sns.boxplot(y = data[variable], data = data)
def time_series_plotter(data, variable):
data[variable].plot(figsize=(18,7), linewidth=5, fontsize=20)
plt.xlabel('Date', fontsize=20);
def moving_average_plotter(data, variable):
var = data[[variable]]
var.rolling(10).mean().plot(figsize=(18,7), linewidth=5, fontsize=20)
plt.xlabel('Date', fontsize=20);
"STATIONARITY IS A BITCH!"
Now sit back and relax before we continue. We will try to answer a few fundamental questions about time series. These are very different than panel data. First of all it has constant time intervals. We check it and see long term trends and forecast for the future.
Dates are the index here and we can not shuffle them.
A Time Series is said to be stationary if its statistical properties such as mean, variance remain constant over time. But why is it important? Most of the TS models work on the assumption that the TS is stationary. Intuitively, we can say that if a TS has a particular behaviour over time, there is a very high probability that it will follow the same in the future. Also, the theories related to stationary series are more mature and easier to implement as compared to non-stationary series.
We will consider it stationary when it has
Let's now pick up one power meter see if it is stationary or not. The simplest way to do it is to apply a Dickey-Fuller Test. Here the null hypothesis is that the data is non-stationary. And the Alternate Hypothesis is that it is stationary.
A good news can be that our time series is stationary. It means we should be able to reject the null hypothesis. That can only happen if our p value is less than, say 5%...
#get your data for a power meter of your choice. We use the class we have created before
data = pm.make_pm_df(df)
data.head()
pm.time_series_plotter(data, 'energy_sum')
pm.anomaly_checker(data, 'energy_sum')
There is an anomaly just before the April, measuring Zero consumption. We will try to model continuous consumption and assume that noone is moving out.
So lets filter them out. Let it go until Feb 15
data = data['2012-07-02':'2014-02-15']
#And apply the test
from pandas import Series
from statsmodels.tsa.stattools import adfuller
series = data['energy_sum']
X = series.values
result = adfuller(X)
print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])
print('Critical Values:')
for key, value in result[4].items():
print('\t%s: %.3f' % (key, value))
Here we can reject the null hypothesis, so our data is stationary.
In fact many time series are not stationary. If that is the case, we try to take it as close as possible to stationary by playing with Trend and Seasonality. However, that is a subject of another tutorial.
Since we consider our data to be stationary with a confidence of greater than 99% already, we won't deal with these steps but proceed ahead.
moving_avg = data[['energy_sum']].rolling(45).mean()
plt.figure(figsize=(18,6))
plt.plot(data['energy_sum'])
plt.plot(moving_avg, color='red');
Forecasting a Time Series
There are AR (auto regressive) models and MA (moving average) models. We will use the ARIMA function which handles and combines them. But what is ARIMA?
The ARIMA forecasting for a stationary time series is nothing but a linear (like a linear regression) equation. The predictors depend on the parameters (p,d,q) of the ARIMA model:
An importance concern here is how to determine the value of ‘p’ and ‘q’. We use two plots to determine these numbers. Lets discuss them first.
The ACF and PACF plots for the TS after differencing can be plotted as:
from statsmodels.tsa.arima_model import ARIMA
#ACF and PACF plots:
from statsmodels.tsa.stattools import acf, pacf
lag_acf = acf(data['energy_sum'], nlags=10)
lag_pacf = pacf(data['energy_sum'], nlags=10, method='ols')
plt.figure(figsize=(8,3))
#Plot ACF:
plt.subplot(121)
plt.plot(lag_acf)
plt.axhline(y=0,linestyle='--',color='gray')
plt.axhline(y=-1.96/np.sqrt(len(data['energy_sum'])),linestyle='--',color='gray')
plt.axhline(y=1.96/np.sqrt(len(data['energy_sum'])),linestyle='--',color='gray')
plt.title('Autocorrelation Function')
#Plot PACF:
plt.subplot(122)
plt.plot(lag_pacf)
plt.axhline(y=0,linestyle='--',color='gray')
plt.axhline(y=-1.96/np.sqrt(len(data['energy_sum'])),linestyle='--',color='gray')
plt.axhline(y=1.96/np.sqrt(len(data['energy_sum'])),linestyle='--',color='gray')
plt.title('Partial Autocorrelation Function')
plt.tight_layout()
In this plot, the two dotted lines on either sides of 0 are the confidence interevals. These can be used to determine the ‘p’ and ‘q’ values as:
p – The lag value where the PACF chart crosses the upper confidence interval for the first time. If you notice closely, in this case p=4.
q – The lag value where the ACF chart does not cross the upper confidence interval. So, q=0.
p talks about AR and q talks about MA in ARIMA. Since q is zero and p is 4, this will be an AR model.
THIS IS HOW THE FITTED VALUES LOOK LIKE IN RED COLOR:
model = ARIMA(data['energy_sum'], order=(4, 0, 1))
results_AR = model.fit(disp=-1)
plt.figure(figsize=(18,6))
plt.plot(data['energy_sum'])
plt.plot(results_AR.fittedvalues, color='red')
plt.title('RSS: %.4f'% sum((results_AR.fittedvalues-data['energy_sum'])**2))
WE CAN ALSO VISUALIZE HOW IT IS FORECASTING:
data['forecast'] = results_AR.predict(start = 500, end= 600, dynamic= True)
data[['energy_sum', 'forecast']].plot(figsize=(18, 6))
plt.show()
WE CAN SEE THE DIFFERENCES BETWEEN PREDICTED AND THE ACTUAL VALUES:
predictions_AR_diff = pd.Series(results_AR.fittedvalues, copy=True)
print(predictions_AR_diff.tail())
WE CAN ADD MORE PARAMETERS IF WE DONT WANT IT THAT MUCH SMOOTH:
import statsmodels.api as sm
# Applying Seasonal ARIMA model to forcast the data
mod = sm.tsa.SARIMAX(data['energy_sum'], trend='n', order=(4,0,1), seasonal_order=(1,1,1,12))
results = mod.fit()
THEN THE FORECAST WILL LOOK LIKE THIS:
data['forecast'] = results.predict(start = 500, end= 600, dynamic= True)
data[['energy_sum', 'forecast']].plot(figsize=(18, 6))
plt.show()
In the next tutorials we will try different models for different power meters, we will forecast for the future and bring together some ideas from other resources on the web. Please let me know about the problems you have spotted and then I will fix them.
Thank you for reading