Wednesday, July 22, 2020

Stack Abuse: Deep Learning Models in Keras - Exploratory Data Analysis (EDA)

Introduction

Deep learning is one of the most interesting and promising areas of artificial intelligence (AI) and machine learning currently. With great advances in technology and algorithms in recent years, deep learning has opened the door to a new era of AI applications.

In many of these applications, deep learning algorithms performed equal to human experts and sometimes surpassed them.

Python has become the go-to language for Machine Learning and many of the most popular and powerful deep learning libraries and frameworks like TensorFlow, Keras, and PyTorch are built in Python.

In this article, we'll be performing Exploratory Data Analysis (EDA) on a dataset before Data Preprocessing and finally, building a Deep Learning Model in Keras and evaluating it.

Why Keras?

Keras is a deep learning API built on top of TensorFlow. TensorFlow is an end-to-end machine learning platform that allows developers to create and deploy machine learning models. TensorFlow was developed and used by Google; though it released under an open-source license in 2015.

Keras provides a high-level API for TensorFlow. It makes it really easy to build different types of machine learning models while taking the benefits of TensorFlow's infrastructure and scalability.

It allows you to define, compile, train, and evaluate deep learning models using simple and concise syntax as we will see later in this series.

Keras is very powerful; it is the most used machine learning tool by top Kaggle champions in the different competitions held on Kaggle.

House Price Prediction with Deep Learning

We will build a regression deep learning model to predict a house price based on the house characteristics such as the age of the house, the number of floors in the house, the size of the house, and many other features.

In the first article of the series, we'll be importing the packages and data and doing some Exploratory Data Analysis (EDA) to get familiar with the dataset we're working with.

Importing the Required Packages

In this preliminary step, we import the packages needed in the next steps:

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

We're importing tensorflow which includes Keras and some other useful tools. For code brevity, we're importing keras and layers separately so instead of tf.keras.layers.Dense we can write layers.Dense.

We're also importing pandas and numpy which are extremely useful and widely used to store and handle the data as well as manipulate it.

And for visualizing and exploring the data, we import plt from the matplotlib package and seaborn. Matplotlib is a fundamental library for visualization, while Seaborn makes it much simpler to work with.

Loading the Data

For this tutorial, we will work with a dataset that reports sales of residential units between 2006 and 2010 in a city called Ames which is located in Iowa, United States.

For each sale, the dataset describes many characteristics of the residential unit and lists the sale price of that unit. This sale price will be the target variable that we want to predict using the different characteristics of the unit.

The dataset actually contains a lot of characteristics data on each unit including the unit area, the year in which the unit was built, the size of the garage, the number of kitchens, the number of bathrooms, the number of bedrooms, the roof style, the type of the electrical system, the class of the building, and many others.

You can read more about the dataset on this page on Kaggle.

To download the exact dataset file that we will be using in this tutorial, visit its Kaggle page and click on the download button. This will download a CSV file containing the data.

We'll rename this file to AmesHousing.csv and load it inside our program using Pandas' read_csv() function:

df = pd.read_csv('AmesHousing.csv')

The loaded dataset contains 2,930 rows (entries) and 82 columns (characteristics). Here's a truncated view of only a few rows and columns:

Order PID MS SubClass MS Zoning Lot Frontage Lot Area Street
0 1 526301100 20 RL 141 31770 Pave
1 2 526350040 20 RH 80 11622 Pave
2 3 526351010 20 RL 81 14267 Pave

As we said earlier, each row describes a residential unit sale by specifying many characteristics of the unit and its sale price. And, again, to get more information about the meaning of each variable in this dataset, please visit this page on Kaggle.

Before we proceed, we will remove some features (columns) from the dataset because they don't provide any useful information to the mode. These features are Order and PID:

df.drop(['Order', 'PID'], axis=1, inplace=True)

Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) helps us understand the data better and spot patterns in it. The most important variable to explore in the data is the target variable: SalePrice.

A machine learning model is as good as the training data - you want to understand it if you want to understand your model. The first step in building any model should be good data exploration.

Since the end-goal is predicting house values, we'll focus on the SalePrice variable and the variables that have high correlation with it.

Sale Price Distribution

First, let's take a look at the distribution of SalePrice. Histograms are a great and simple way to take a look at distributions of variables. Let's use Matplotlib to plot a histogram that displays the distribution of the SalePrice:

fig, ax = plt.subplots(figsize=(14,8))
sns.distplot(df['SalePrice'], kde=False, ax=ax)

The image below shows the resulting histogram after applying some formatting to enhance the appearance:

sale price histogram

We can also look at the SalePrice distribution using different types of plots. For example, let's make a swarm plot of SalePrice:

fig, ax = plt.subplots(figsize=(14,8))
sns.swarmplot(df['SalePrice'], color='#2f4b7c', alpha=0.8, ax=ax)

This would result in:

sale price swarm plot

By looking at the histogram and swarm plot above, we can see that for most units, the sale price ranges from $100,000 to $200,000. If we generate a description of the SalePrice variable using Pandas' describe() function:

print(df['SalePrice'].describe().apply(lambda x: '{:,.1f}'.format(x)))

We'll receive:

count      2,930.0
mean     180,796.1
std       79,886.7
min       12,789.0
25%      129,500.0
50%      160,000.0
75%      213,500.0
max      755,000.0
Name: SalePrice, dtype: object

From here, we know that:

  • The average sale price is $180,796
  • The minimum sale price is $12,789
  • The maximum sale price is $755,000

Correlation with Sale Price

Now, let's see how predictor variables in our data correlate with the target SalePrice. We will calculate these correlation values using Pearson's method and then visualize the correlations using a heatmap:

fig, ax = plt.subplots(figsize=(10,14))
saleprice_corr = df.corr()[['SalePrice']].sort_values(
  by='SalePrice', ascending=False)
sns.heatmap(saleprice_corr, annot=True, ax=ax) 

And here is the heatmap that shows how predictor variables are correlated with SalePrice.

Lighter colors in the map indicate higher positive correlation values and darker colors indicate lower positive correlation values and sometimes negative correlation values:

correlation heatmap

Obviously, the SalePrice variable has a 1:1 correlation with itself. Though, there are some other variables that are highly correlated with the SalePrice that we can draw some conclusions from.

For example, we can see that SalePrice is highly correlated with the Overall Qual variable which describes the overall quality of material and finish of the house. We can also see a high correlation with Gr Liv Area which specifies the above-ground living area of the unit.

Examining the Different Correlation Degrees

Now that we have some variables that are highly correlated with SalePrice in mind, let's examine the correlations more deeply.

Some variables are highly correlated with the SalePrice, and some aren't. By checking these out, we can draw conclusions on what's prioritized when people are buying properties.

High Correlation

First, let's look at two variables that have high positive correlation with SalePrice - namely Overall Qual which has a correlation value of 0.8 and Gr Liv Area which has a correlation value of 0.71.

Overall Qual represents the overall quality of material and finish of the house. Let's explore their relationship further by plotting a scatter plot, using Matplotlib:

fig, ax = plt.subplots(figsize=(14,8))
ax.scatter(x=df['Overall Qual'], y=df['SalePrice'], color="#388e3c", 
           edgecolors="#000000", linewidths=0.1, alpha=0.7);
plt.show()

Here is the resulting scatter plot:

high correlation scatter plot

We can clearly see that as the overall quality increases, the house sale price tends to increase as well. The increase isn't quite linear, but if we drew a trendline, it would be relatively close to linear.

Now, let's see how Gr Liv Area and SalePrice relate to each other with another scatter plot:

fig, ax = plt.subplots(figsize=(14,8))
ax.scatter(x=df['Gr Liv Area'], y=df['SalePrice'], color="#388e3c", 
           edgecolors="#000000", linewidths=0.1, alpha=0.7);
plt.show()

Here is the resulting scatter plot:

high correlation scatter plot

Again, we can clearly see the high positive correlation between Gr Liv Area and SalePrice in this scatter plot. They tend to increase with each other, with a few outliers.

Moderate Correlation

Next, let's look at variables that have a moderate positive correlation with SalePrice. We will look at Lot Frontage which has a correlation value of 0.36 and Full Bath which has a correlation value of 0.55.

Lot Frontage represents the length of the lot in front of the house, all the way to the street. And Full Bath represents the number of full bathrooms above ground.

Similar to what we have done with Overall Qual and Gr Liv Area, we will plot two scatter plots to visualize the relationships between these variables and the SalePrice.

Let's start with Lot Frontage:

fig, ax = plt.subplots(figsize=(14,8))
ax.scatter(x=df['Lot Frontage'], y=df['SalePrice'], color="orange", 
           edgecolors="#000000", linewidths=0.5, alpha=0.5);
plt.show()

moderate correlation scatter plot

Here, you can see a much weaker correlation. Even with larger lots in front of the properties, the price doesn't go up by much. There is a positive correlation between the two, but it doesn't seem to be as important to buyers as some other variables.

Then, let's show the scatter plot for Full Bath:

fig, ax = plt.subplots(figsize=(14,8))
ax.scatter(x=df['Full Bath'], y=df['SalePrice'], color="orange", 
           edgecolors="#000000", linewidths=0.5, alpha=0.5);
plt.show()

moderate correclation scatter plot

Here, you can also see a positive correlation, which isn't that weak, but also isn't too strong. A good portion of houses with two full bathrooms have the exact same price as the houses with only one bathroom. The number of bathrooms does influence the price, but not too much.

Low Correlation

Finally, let's look at variables that have a low positive correlation with SalePrice and compare them with what we saw so far. We will look at Yr Sold which has a correlation value of -0.031 and Bsmt Unf SF which has a correlation value of 0.18.

Yr Sold represents the year in which the house was sold. And Bsmt Unf SF represents the unfinished basement area in square feet.

Let's start with Yr Sold:

fig, ax = plt.subplots(figsize=(14,8))
ax.scatter(x=df['Yr Sold'], y=df['SalePrice'], color="#b71c1c", 
           edgecolors="#000000", linewidths=0.1, alpha=0.5);
ax.xaxis.set_major_formatter(
  ticker.FuncFormatter(func=lambda x,y: int(x)))
plt.show()

low correlation scatter plot

The correlation here is so weak that it's fairly safe to assume that there's basically no correlation between these two variables. It's safe to assume that the prices of properties haven't changed much between 2006 and 2010.

Let's also make a plot for Bsmt Unf SF:

fig, ax = plt.subplots(figsize=(14,8))
ax.scatter(x=df['Bsmt Unf SF'], y=df['SalePrice'], color="#b71c1c", 
           edgecolors="#000000", linewidths=0.1, alpha=0.5);
plt.show()

low correlation scatter plot

Here, we can see some properties with lower Bsmt Unf SF being sold for higher than ones with a high value. Then again, this could be due to pure chance, and there isn't an apparent correlation between the two.

It's safe to assume that Bsmt Unf SF doesn't have much to do with the SalePrice.

Conclusion

In this article, we've made the first steps in most machine learning projects. We started off with downloading and loading in a dataset that we're interested in.

Then, we've performed Exploratory Data Analysis on the data to get a good understanding of what we're dealing with. A machine learning model is as good as the training data - you want to understand it if you want to understand your model.

Finally, we've chosen a few variables and checked for their correlation with the main variable we're eyeing - the SalePrice variable.



from Planet Python
via read more

No comments:

Post a Comment

TestDriven.io: Working with Static and Media Files in Django

This article looks at how to work with static and media files in a Django project, locally and in production. from Planet Python via read...