Wednesday, January 30, 2019

Python Data: Stationary Data Tests for Time Series Forecasting

I wasn’t planning on making a ‘part 2’ to the Forecasting Time Series Data using Autoregression post from last week, but I really wanted to show how to use more advanced tests to check for stationary data. Additionally, I wanted to use a new dataset that I ran across on Kaggle for energy consumption at an hourly level (find the dataset here).  For this example, I’m going to be using the DEOK_hourly dataset (i’ve added it to my git repo here).  You can follow along with the jupyter notebook here.

In this post, I’m going to follow the same approach that I took in the previous one – using autoregression to forecast time series data after checking to ensure the data is stationary.

Checking for Stationary data

So, what do we need to do to check for stationary data?  We can do the following:

  • Plot the data – this is the first step and often will provide a great deal of information about your data. Regardless of the data you’re using or the steps you take afterwards, this should always be the first step in your process.
  • Statistics Summaries and Tests  – There are a plethora of statistical tests that you can / should run but a quick summary of your data is probably the best thing to do.  Additionally, you can run tests like the Dickey-Fuller test to help understand your data and its stationarity.

Let’s plot our data first and take a look at a couple different plots. First, let’s get our imports taken care of.

import pandas as pd
import numpy as np
import matplotlib.pylab as plt
%matplotlib inline
 
plt.rcParams['figure.figsize']=(20,10)
plt.style.use('ggplot')

Next, let’s load our data and plot the time series.

data = pd.read_csv('DEOK_hourly.csv')
data['Datetime']=pd.to_datetime(data['Datetime'])
data.set_index('Datetime', inplace=True)

DEOK Time Series plot

Looking at the data, it looks pretty stationary. There’s no real trend in the time series but there seems to be something that might be seasonality, so we’ll dig deeper into the data.  Let’s plot a histogram to see what the underlying distribution looks like.

data['DEOK_MW'].hist()

DEOK Histogram

Looks Gaussian with a bit of a long tail skew toward the right. From this histogram, I’m pretty confident that we have a stationary dataset otherwise we’d see something much less ‘bell-shaped’ due to trending and/or seasonality (e.g., we’d see more data plotted to the left or right).

Now, let’s look at some statistical tests. A simple one that you can use is to look at the mean and variance of multiple sections of the data and compare them. If they are similar, your data is most likely stationary.

There are many different ways to split the data for this check, but one way I like to do this is to follow the approach highlighted here.

one, two, three = np.split(
        data['DEOK_MW'].sample(
        frac=1), [int(.25*len(data['DEOK_MW'])),
        int(.75*len(data['DEOK_MW']))])

The above code creates three new series. I randomly selected 25% for series one and 75% for the two and three – but you could create them of equal length if you wanted. I like making them different sizes just for a bit of extra randomness to the test.

Next, we’ll look at the means and variances of each series to see what they look like. Remember, if the data is stationary, the means/variances should be similar.

mean1, mean2, mean3 = one.mean(), two.mean(), three.mean()
var1, var2, var3 = one.var(), two.var(), three.var()

print mean1, mean2, mean3
print var1, var2, var3

The output of this is:

3093.27497575 3107.45445099 3112.20124697
353154.655416 363558.421407 358899.692558

Not great formatting, but you can quickly see that the means and variances are similar, pointing to stationary data.

Now that you know how to find stationarity using some plots and some basic stats, you should know that the above tests can be fooled sometimes, especially since they make assumptions about your data. So…don’t rely on these only…they’re a quick way to see what you have without having to pull out the big guns and run things like the Dickey-Fuller test.

Dickey-Fuller Test for Stationarity

Officially, this is called the ‘augmented Dickey-Fuller test’, but most folks just say ‘Dickey-Fuller’ when talking about it.  This is a test that tests the null hypothesis that a unit root is present in time series data.    To make things a bit more clear, this test is checking for stationarity or non-stationary data.  The test is trying to reject the null hypothesis that a unit root exists and the data is non-stationary. If the null hypothesis is rejected, then the alternate can be considered valid (e.g., the data is stationary).  You can read more about the test here if interested.

When you run the test, you’ll get an ADF value and a p-value. The ADF number should be a negative number and the p-value should be beneath a certain threshold value (e.g., 1% or 5%, etc) for a confidence level. For this example, we’ll use 5% (or 95% confidence level), so if the p-value is greater than 0.05 then we say we fail to reject the null hypothesis because the data has a unit root and is non-stationary.  If the p-value is less than or equal to 0.05, we can say we reject the null hypothesis because the data does not have a unit root and is stationary.

Let’s run the Augmented Dickey-Fuller test and see what we see.  The statsmodels library has a function called adfuller to make it easy for us to run this test.

from statsmodels.tsa.stattools import adfuller

adf_test = adfuller(data['DEOK_MW'])

print "ADF = " + str(adf_test[0])
print "p-value = " +str(adf_test[1])

In this code, we import the adfuller library from the statsmodels library and then run our data through the test.  The full output of the test is:

(-14.913267801069782,
 1.4477674072055658e-27,
 57,
 57681,
 {'1%': -3.4304633751328555,
  '10%': -2.5667966716717614,
  '5%': -2.8615901096273602},
 669611.23911962728)

The ADF value is the first value in the result and the p-value is the 2nd.  The ‘1%’, ‘10%’ and ‘5%’ values are the critical values for 99%, 90% and 95% confidence levels.

Let’s look specifically at our ADF and p-values.

print "ADF = " + str(adf_test[0])
print "p-value = " +str(adf_test[1])

We get these results:

ADF = -14.9132678011
p-value = 1.44776740721e-27

Our p-value is definitely less than 0.5 and is even less than 0.01 so we can say with pretty good confidence that we can reject the null (unit root, non-stationary data) and can assume our data is stationary. Additionally, our ADF is much less than our 1% confidence value of -3.43, so we have another confirmation that we can reject the null.

Now that we know its stationary, we need to see if its correlated (remember there’s an assumption of dependance / correlation for autoregression). Let’s look at a lagplot.

pd.tools.plotting.lag_plot(data['DEOK_MW'])

DEOK Lag Plot

No question…that data is correlated somehow.

Now…we can actually DO something with the data! Let’s run a forecast on it now using autoregression.

Forecasting Time Series Data using Autoregression

We know our data is stationary and correlated (or at least we *believe* it is based on our tests). Let’s run our autoregression forecast and see what we see.

For this, we’ll use a different approach than we did before sine we have much more data. We’ll use the same training/testing data creation that we used in the previous post and create a 12 period testing dataset and prediction dataset (i.e., we are going to predict the ‘next’ 12 readings).

#create train/test datasets
X = data['DEOK_MW'].dropna()

train_data = X[1:len(X)-12]
test_data = X[len(X)-12:]

Now, we’ll run our the AR() model.

from statsmodels.tsa.ar_model import AR
from sklearn.metrics import r2_score

#train the autoregression model
model = AR(train_data)
model_fitted = model.fit()

print('The lag value chose is: %s' % model_fitted.k_ar)

The lag value chosen for this model is 59.  Now, let’s make some predictions and check the accuracy.

# make predictions 
predictions = model_fitted.predict(
    start=len(train_data), 
    end=len(train_data) + len(test_data)-1, 
    dynamic=False)


# create a comparison dataframe
compare_df = pd.concat(
    [data['DEOK_MW'].reset_index().tail(12),
    predictions], axis=1).rename(
    columns={'DEOK_MW': 'actual', 0:'predicted'})
compare_df=compare_df[['actual', 'predicted']].dropna()

In the above, we are making predictions and then creating a dataframe to compare the ‘predicted’ values versus the ‘actual’ values. Plotting these values together gives us the following.

DEOK Actual vs Predicted

Not a bad forecast with the cycle being pretty good but magnitude being a bit off. Let’s take a look at r-squared.

r2 = r2_score(compare_df.actual, compare_df.predicted)

Our r-squared is 0.76, which is pretty good for a first pass at this data and forecasting, especially given the fact that our lag is auto-selected for us.

Hopefully this helps shed some light on how to use statistical tests and plots to check for stationarity when running forecasts with time series data.


Contact me / Hire me

If you’re working for an organization and need help with forecasting, data science, machine learning/AI or other data needs, contact me and see how I can help. Also, feel free to read more about my background on my Hire Me page. I also offer data science mentoring services for beginners wanting to break into data science….if this is of interested, contact me.


To learn more about Time Series Forecasting, I highly recommend the following books:

 

The post Stationary Data Tests for Time Series Forecasting appeared first on Python Data.



from Planet Python
via read more

No comments:

Post a Comment

TestDriven.io: Working with Static and Media Files in Django

This article looks at how to work with static and media files in a Django project, locally and in production. from Planet Python via read...