Tuesday, April 20, 2021

John Ludhi/nbshare.io: Time Series Analysis Using ARIMA From StatsModels

Time Series Analysis Using ARIMA From Statsmodels

ARIMA and exponential Moving averages are two methods for forecasting based on time series data. In this notebook, I will talk about ARIMA which is an acronym for Autoregressive Integrated Moving Averages.

Autoregressive Integrated Moving Averages (ARIMA)

The general process for ARIMA models is the following:

  • Visualize the Time Series Data
  • Make the time series data stationary
  • Plot the Correlation and AutoCorrelation Charts
  • Construct the ARIMA Model or Seasonal ARIMA based on the data
  • Use the model to make predictions

Let's go through these steps!

Monthly Champagne Sales Data

In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline

For this example, I took the sales data which is available on kaggle https://ift.tt/3sDiIi5

In [2]:
df=pd.read_csv('perrin-freres-monthly-champagne-.csv')
In [3]:
df.head()
Out[3]:
Month Perrin Freres monthly champagne sales millions ?64-?72
0 1964-01 2815.0
1 1964-02 2672.0
2 1964-03 2755.0
3 1964-04 2721.0
4 1964-05 2946.0
In [4]:
df.tail()
Out[4]:
Month Perrin Freres monthly champagne sales millions ?64-?72
102 1972-07 4298.0
103 1972-08 1413.0
104 1972-09 5877.0
105 NaN NaN
106 Perrin Freres monthly champagne sales millions... NaN

Data Cleaning

In [5]:
## Cleaning up the data
df.columns=["Month","Sales"]
df.head()
Out[5]:
Month Sales
0 1964-01 2815.0
1 1964-02 2672.0
2 1964-03 2755.0
3 1964-04 2721.0
4 1964-05 2946.0

Our objective is to forecast the champagne sales.

In [6]:
## Drop last 2 rows
df.drop(106,axis=0,inplace=True)

Axis=0, means row. Learn more about dropping rows or columns in Pandas here

In [7]:
df.tail()
Out[7]:
Month Sales
101 1972-06 5312.0
102 1972-07 4298.0
103 1972-08 1413.0
104 1972-09 5877.0
105 NaN NaN
In [8]:
df.drop(105,axis=0,inplace=True)
In [9]:
df.tail()
Out[9]:
Month Sales
100 1972-05 4618.0
101 1972-06 5312.0
102 1972-07 4298.0
103 1972-08 1413.0
104 1972-09 5877.0
In [10]:
# Convert Month into Datetime
df['Month']=pd.to_datetime(df['Month'])
In [11]:
df.head()
Out[11]:
Month Sales
0 1964-01-01 2815.0
1 1964-02-01 2672.0
2 1964-03-01 2755.0
3 1964-04-01 2721.0
4 1964-05-01 2946.0
In [13]:
df.set_index('Month',inplace=True)
In [14]:
df.head()
Out[14]:
Sales
Month
1964-01-01 2815.0
1964-02-01 2672.0
1964-03-01 2755.0
1964-04-01 2721.0
1964-05-01 2946.0
In [15]:
df.describe()
Out[15]:
Sales
count 105.000000
mean 4761.152381
std 2553.502601
min 1413.000000
25% 3113.000000
50% 4217.000000
75% 5221.000000
max 13916.000000

Visualize the Time Series Data

In [16]:
df.plot()
Out[16]:
<AxesSubplot:xlabel='Month'>

Testing For Stationarity of Data using Statsmodels adfuller

Stationary data means data which has no trend with respect to the time.

In [17]:
### Testing For Stationarity
from statsmodels.tsa.stattools import adfuller
In [18]:
test_result=adfuller(df['Sales'])
In [26]:
#Ho: It is non stationary
#H1: It is stationary

def adfuller_test(sales):
    result=adfuller(sales)
    labels = ['ADF Test Statistic','p-value','#Lags Used','Number of Observations Used']
    for value,label in zip(result,labels):
        print(label+' : '+str(value) )
    if result[1] <= 0.05:
        print("P value is less than 0.05 that means we can reject the null hypothesis(Ho). Therefore we can conclude that data has no unit root and is stationary")
    else:
        print("Weak evidence against null hypothesis that means time series has a unit root which indicates that it is non-stationary ")
In [27]:
adfuller_test(df['Sales'])
ADF Test Statistic : -1.8335930563276217
p-value : 0.3639157716602457
#Lags Used : 11
Number of Observations Used : 93
Weak evidence against null hypothesis that means time series has a unit root which indicates that it is non-stationary 

Differencing

Differencing helps remove the changes from the data and make data stationary.

In [28]:
df['Sales First Difference'] = df['Sales'] - df['Sales'].shift(1)
In [29]:
df['Sales'].shift(1)
Out[29]:
Month
1964-01-01       NaN
1964-02-01    2815.0
1964-03-01    2672.0
1964-04-01    2755.0
1964-05-01    2721.0
               ...  
1972-05-01    4788.0
1972-06-01    4618.0
1972-07-01    5312.0
1972-08-01    4298.0
1972-09-01    1413.0
Name: Sales, Length: 105, dtype: float64

we have monthly data so let us try a shift value of 12.

In [30]:
df['Seasonal First Difference']=df['Sales']-df['Sales'].shift(12)
In [31]:
df.head(14)
Out[31]:
Sales Sales First Difference Seasonal First Difference
Month
1964-01-01 2815.0 NaN NaN
1964-02-01 2672.0 -143.0 NaN
1964-03-01 2755.0 83.0 NaN
1964-04-01 2721.0 -34.0 NaN
1964-05-01 2946.0 225.0 NaN
1964-06-01 3036.0 90.0 NaN
1964-07-01 2282.0 -754.0 NaN
1964-08-01 2212.0 -70.0 NaN
1964-09-01 2922.0 710.0 NaN
1964-10-01 4301.0 1379.0 NaN
1964-11-01 5764.0 1463.0 NaN
1964-12-01 7312.0 1548.0 NaN
1965-01-01 2541.0 -4771.0 -274.0
1965-02-01 2475.0 -66.0 -197.0

Let us check if the data now is stationary.

In [32]:
## Again test dickey fuller test
adfuller_test(df['Seasonal First Difference'].dropna())
ADF Test Statistic : -7.626619157213163
p-value : 2.060579696813685e-11
#Lags Used : 0
Number of Observations Used : 92
P value is less than 0.05 that means we can reject the null hypothesis(Ho). Therefore we can conclude that data has no unit root and is stationary
In [33]:
df['Seasonal First Difference'].plot()
Out[33]:
<AxesSubplot:xlabel='Month'>

Auto Regressive Model

In [32]:
from statsmodels.tsa.arima_model import ARIMA
import statsmodels.api as sm
  1. Partial Auto Correlation Function - Takes in to account the impact of direct variables only
  2. Auto Correlation Function - Takes in to account the impact of all the variables (direct + indirect)

Let us plot lags on the horizontal and the correlations on vertical axis using plot_acf and plot_pacf function.

In [37]:
from statsmodels.graphics.tsaplots import plot_acf,plot_pacf
In [38]:
fig = plt.figure(figsize=(12,8))
ax1 = fig.add_subplot(211)
fig = sm.graphics.tsa.plot_acf(df['Seasonal First Difference'].iloc[13:],lags=40,ax=ax1)
ax2 = fig.add_subplot(212)
fig = sm.graphics.tsa.plot_pacf(df['Seasonal First Difference'].iloc[13:],lags=40,ax=ax2)

In the graphs above, each spike(lag) which is above the dashed area considers to be statistically significant.

In [ ]:
# For non-seasonal data
#p=1 (AR specification), d=1 (Integration order), q=0 or 1 (MA specification/polynomial)
AR specification, Integration order, MA specification
from statsmodels.tsa.arima_model import ARIMA
In [52]:
model=ARIMA(df['Sales'],order=(1,1,1))
model_fit=model.fit()
In [53]:
model_fit.summary()
Out[53]:
ARIMA Model Results
Dep. Variable: D.Sales No. Observations: 104
Model: ARIMA(1, 1, 1) Log Likelihood -951.126
Method: css-mle S.D. of innovations 2227.262
Date: Mon, 19 Apr 2021 AIC 1910.251
Time: 23:29:19 BIC 1920.829
Sample: 02-01-1964 HQIC 1914.536
- 09-01-1972
coef std err z P>|z| [0.025 0.975]
const 22.7835 12.405 1.837 0.066 -1.530 47.097
ar.L1.D.Sales 0.4343 0.089 4.866 0.000 0.259 0.609
ma.L1.D.Sales -1.0000 0.026 -38.503 0.000 -1.051 -0.949
Roots
Real Imaginary Modulus Frequency
AR.1 2.3023 +0.0000j 2.3023 0.0000
MA.1 1.0000 +0.0000j 1.0000 0.0000

We can also do line and density plot of residuals.

In [59]:
from matplotlib import pyplot
residuals = pd.DataFrame(model_fit.resid)
residuals.plot()
pyplot.show()
# density plot of residuals
residuals.plot(kind='kde')
pyplot.show()
# summary stats of residuals
print(residuals.describe())
                 0
count   104.000000
mean     87.809661
std    2257.896169
min   -6548.758563
25%    -821.138569
50%     -87.526059
75%    1221.542864
max    6177.251803

As we see above, mean is not exactly zero that means there is some bias in the data.

In [54]:
df['forecast']=model_fit.predict(start=90,end=103,dynamic=True)
df[['Sales','forecast']].plot(figsize=(12,8))
Out[54]:
<AxesSubplot:xlabel='Month'>

If you observe the above, we are not getting good results using ARIMA because our data has seasonal behaviour, So let us try using seasonal ARIMA.

In [42]:
import statsmodels.api as sm
In [55]:
model=sm.tsa.statespace.SARIMAX(df['Sales'],order=(1, 1, 1),seasonal_order=(1,1,1,12))
results=model.fit()

Note above the seasonal_order tuples which takes the following format (Seasonal AR specification, Seasonal Integration order, Seasonal MA, Seasonal periodicity)

In [56]:
results.summary()
Out[56]:
SARIMAX Results
Dep. Variable: Sales No. Observations: 105
Model: SARIMAX(1, 1, 1)x(1, 1, 1, 12) Log Likelihood -738.402
Date: Mon, 19 Apr 2021 AIC 1486.804
Time: 23:29:33 BIC 1499.413
Sample: 01-01-1964 HQIC 1491.893
- 09-01-1972
Covariance Type: opg
coef std err z P>|z| [0.025 0.975]
ar.L1 0.2790 0.081 3.433 0.001 0.120 0.438
ma.L1 -0.9494 0.043 -22.334 0.000 -1.033 -0.866
ar.S.L12 -0.4544 0.303 -1.499 0.134 -1.049 0.140
ma.S.L12 0.2450 0.311 0.788 0.431 -0.365 0.855
sigma2 5.055e+05 6.12e+04 8.265 0.000 3.86e+05 6.25e+05
Ljung-Box (L1) (Q): 0.26 Jarque-Bera (JB): 8.70
Prob(Q): 0.61 Prob(JB): 0.01
Heteroskedasticity (H): 1.18 Skew: -0.21
Prob(H) (two-sided): 0.64 Kurtosis: 4.45


Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

Let us plot again line and density chart of residuals.

In [60]:
from matplotlib import pyplot
residuals = pd.DataFrame(results.resid)
residuals.plot()
pyplot.show()
# density plot of residuals
residuals.plot(kind='kde')
pyplot.show()
# summary stats of residuals
print(residuals.describe())
                 0
count   105.000000
mean    -69.284285
std     996.587108
min   -6006.398653
25%    -475.852083
50%     -83.470336
75%     306.809583
max    2815.000000
In [57]:
df['forecast']=results.predict(start=90,end=103,dynamic=True)
df[['Sales','forecast']].plot(figsize=(12,8))
Out[57]:
<AxesSubplot:xlabel='Month'>

Conclusion: If you compare the ARIMA and SARIMA results, SARIMA gives good result comapre to ARIMA.

Forecasting for Next 5 years using SARIMA

In [45]:
5*12
Out[45]:
60
In [46]:
from pandas.tseries.offsets import DateOffset
future_dates=[df.index[-1]+ DateOffset(months=x)for x in range(0,60)]
In [47]:
future_datest_df=pd.DataFrame(index=future_dates[1:],columns=df.columns)
In [48]:
future_datest_df.tail()
Out[48]:
Sales Sales First Difference Seasonal First Difference forecast
1977-04-01 NaN NaN NaN NaN
1977-05-01 NaN NaN NaN NaN
1977-06-01 NaN NaN NaN NaN
1977-07-01 NaN NaN NaN NaN
1977-08-01 NaN NaN NaN NaN
In [49]:
future_df=pd.concat([df,future_datest_df])
In [50]:
future_df['forecast'] = results.predict(start = 104, end = 156, dynamic= True)  
future_df[['Sales', 'forecast']].plot(figsize=(12, 8)) 
Out[50]:
<AxesSubplot:>


from Planet Python
via read more

No comments:

Post a Comment

TestDriven.io: Working with Static and Media Files in Django

This article looks at how to work with static and media files in a Django project, locally and in production. from Planet Python via read...