Introduction
Deep learning is one of the most interesting and promising areas of artificial intelligence (AI) and machine learning currently. With great advances in technology and algorithms in recent years, deep learning has opened the door to a new era of AI applications.
In many of these applications, deep learning algorithms performed equal to human experts and sometimes surpassed them.
Python has become the go-to language for Machine Learning and many of the most popular and powerful deep learning libraries and frameworks like TensorFlow, Keras, and PyTorch are built in Python.
In this series, we'll be using Keras to perform Exploratory Data Analysis (EDA), Data Preprocessing and finally, build a Deep Learning Model and evaluate it.
If you haven't already, check out our first article - Deep Learning Models in Keras - Exploratory Data Analysis (EDA).
Data Preprocessing
In the preprocessing stage, we'll prepare the data to be fed to the Keras model. The first step is clearing the dataset of null values. Then, we'll use one-hot encoding to convert categorical variables to numerical variables. Neural Nets work with numerical data, not categorical.
We'll also split the data into a training and testing set. Finally, we'll scale the data/standardize it so that it ranges from -1 to 1. This standardization helps both train the model better and allows it to converge easier.
Dealing with Missing Values
Let's find out the number and percentage of missing values in each variable in the dataset:
missing_values = pd.DataFrame({
'Column': df.columns.values,
'# of missing values': df.isna().sum().values,
'% of missing values': 100 * df.isna().sum().values / len(df),
})
missing_values = missing_values[missing_values['# of missing values'] > 0]
print(missing_values.sort_values(by='# of missing values',
ascending=False
).reset_index(drop=True))
This code will produce the following table which shows us variables that contain missing values and how many missing values they contain:
Column | # of missing values | % of missing values | |
0 | Pool QC | 2917 | 99.5563 |
1 | Misc Feature | 2824 | 96.3823 |
2 | Alley | 2732 | 93.2423 |
3 | Fence | 2358 | 80.4778 |
4 | Fireplace Qu | 1422 | 48.5324 |
5 | Lot Frontage | 490 | 16.7235 |
6 | Garage Cond | 159 | 5.42662 |
7 | Garage Qual | 159 | 5.42662 |
8 | Garage Finish | 159 | 5.42662 |
9 | Garage Yr Blt | 159 | 5.42662 |
10 | Garage Type | 157 | 5.35836 |
11 | Bsmt Exposure | 83 | 2.83276 |
12 | BsmtFin Type 2 | 81 | 2.76451 |
13 | BsmtFin Type 1 | 80 | 2.73038 |
14 | Bsmt Qual | 80 | 2.73038 |
15 | Bsmt Cond | 80 | 2.73038 |
16 | Mas Vnr Area | 23 | 0.784983 |
17 | Mas Vnr Type | 23 | 0.784983 |
18 | Bsmt Half Bath | 2 | 0.0682594 |
19 | Bsmt Full Bath | 2 | 0.0682594 |
20 | Total Bsmt SF | 1 | 0.0341297 |
Since Pool QC
, Misc Feature
, Alley
, Fence
, and Fireplace Qu
variables contain a high percentage of missing values as shown in the table, we will simply remove them as they probably won't affect the results much at all:
df.drop(['Pool QC', 'Misc Feature', 'Alley', 'Fence', 'Fireplace Qu'],
axis=1, inplace=True)
For other variables that contain missing values, we will replace these missing values depending on the data type of the variable: whether it is numerical or categorical.
If it is numerical, we will replace missing values with the variable mean. If it is categorical, we will replace the missing values with the variable mode. This removes the false bias that can be created with missing values in a neutral way.
To know which variables are numerical and which are categorical, we will print out 5 unique items for each of the variables that contain missing values using this code:
cols_with_missing_values = df.columns[df.isna().sum() > 0]
for col in cols_with_missing_values:
print(col)
print(df[col].unique()[:5])
print('*'*30)
And we get the following results:
Lot Frontage
[141. 80. 81. 93. 74.]
******************************
Mas Vnr Type
['Stone' 'None' 'BrkFace' nan 'BrkCmn']
******************************
...
Let's replace the values of missing numerical values with the mean:
num_with_missing = ['Lot Frontage', 'Mas Vnr Area', 'BsmtFin SF 1', 'BsmtFin SF 2',
'Bsmt Unf SF', 'Total Bsmt SF', 'Bsmt Full Bath', 'Bsmt Half Bath',
'Garage Yr Blt', 'Garage Cars', 'Garage Area']
for n_col in num_with_missing:
df[n_col] = df[n_col].fillna(df[n_col].mean())
Here, we just put them all in a list and assigned new values to them. Next, let's replace missing values for categorical variables:
cat_with_missing = [x for x in cols_with_missing_values if x not in num_with_missing]
for c_col in cat_with_missing:
df[c_col] = df[c_col].fillna(df[c_col].mode().to_numpy()[0])
After this step, our dataset will have no missing values in it.
One-Hot Encoding of Categorical Variables
Keras models, like all machine learning models fundamentally work with numerical data. Categorical data has no meaning to a computer, but it does do us. We need to convert these categorical variables into numerical representations in order for the dataset to be usable.
The technique that we will use to do that conversion is One-Hot Encoding. Pandas provides us with a simple way to automatically perform One-Hot encoding on all categorical variables in the data.
Before that though, we must ensure that no categorical variable in our data is represented as a numerical variable by accident.
Checking Variables Data Types
When we read a CSV dataset using Pandas as we did, Pandas automatically tries to determine the type of each variable in the dataset.
Sometimes, Pandas can determine this incorrectly - if a categorical variable is represented with numbers, it can wrongfully infer that it's a numerical variable.
Let's check if there are any data type discrepancies in the DataFrame
:
data_types = pd.DataFrame({
'Column': df.select_dtypes(exclude='object').columns.values,
'Data type': df.select_dtypes(exclude='object').dtypes.values
})
print(data_types)
Column | Data type | |
0 | MS SubClass | int64 |
1 | Lot Frontage | float64 |
2 | Lot Area | int64 |
3 | Overall Qual | int64 |
4 | Overall Cond | int64 |
5 | Year Built | int64 |
6 | Year Remod/Add | int64 |
Based on this table and the variables descriptions from Kaggle, we can notice which variables were falsely considered numerical by Pandas.
For example, MS SubClass
was detected as a numerical variable with a data type of int64
. However, based on the description of this variable, it specifies the type of the unit being sold.
If we take a look at the unique values of this variable:
df['MS SubClass'].unique().tolist()
We get this output:
[20, 60, 120, 50, 85, 160, 80, 30, 90, 190, 45, 70, 75, 40, 180, 150]
This variable represent different unit types as numbers like 20
(one story dwellings built in 1946 and newer), 60
(2 story dwellings built in 1946 and newer), etc.
This actually isn't a numerical variable but a categorical one. Let's convert it back into a categorical variable by reassigning it as a string:
df['MS SubClass'] = df['MS SubClass'].astype(str)
Performing One-Hot Encoding
Before performing One-Hot Encoding, we want to select a subset of the features from our data to use from now on. We'll want to do so because our dataset contains 2,930 records and 75 features.
Many of these features are categorical. So if we keep all the features and perform One-Hot Encoding, the resulting number of features will be large and the model might suffer from the curse of dimensionality as a result.
Let's make a list of the variables we want to keep in a subset and trim the DataFrame
so we only use these:
selected_vars = ['MS SubClass', 'MS Zoning', 'Lot Frontage', 'Lot Area',
'Neighborhood', 'Overall Qual', 'Overall Cond',
'Year Built', 'Total Bsmt SF', '1st Flr SF', '2nd Flr SF',
'Gr Liv Area', 'Full Bath', 'Half Bath', 'Bedroom AbvGr',
'Kitchen AbvGr', 'TotRms AbvGrd', 'Garage Area',
'Pool Area', 'SalePrice']
df = df[selected_vars]
Now we can perform One-Hot Encoding easily by using Pandas' get_dummies()
function:
df = pd.get_dummies(df)
After one-hot encoding, the dataset will have 67 variables. Here are the capped first few rows - there are many more variables than this:
Lot Frontage | Lot Area | Overall Qual | Overall Cond | Year Built | Total Bsmt SF | 1st Flr SF | 2nd Flr SF | Gr Liv Area | |
0 | 141 | 31770 | 6 | 5 | 1960 | 1080 | 1656 | 0 | 1656 |
1 | 80 | 11622 | 5 | 6 | 1961 | 882 | 896 | 0 | 896 |
2 | 81 | 14267 | 6 | 6 | 1958 | 1329 | 1329 | 0 | 1329 |
Splitting Data into Training and Testing Sets
One of the last steps in data preprocessing is to split it in a training and testing subset. We'll be training the model on the training subset, and evaluating it with an unseen test set.
We will split the data randomly so that the training set will have 80% of the data and the testing set will have 20% of the data. Generally, the training set typically has anywhere between 70-80% of the data, while 20-30% is used for validation.
This is made really simple with Pandas' sample()
and drop()
functions:
train_df = df.sample(frac=0.8, random_state=9)
test_df = df.drop(train_df.index)
Now train_df
holds our training data and test_df
holds our testing data.
Next, we will store the target variable SalePrice
separately for each of the training and testing sets:
train_labels = train_df.pop('SalePrice')
test_labels = test_df.pop('SalePrice')
We're removing the SalePrice
value because, well, we want to predict it. There's no point predicting something we already know and have fed to the model. We'll be using the actual values to verify if our predictions are correct.
After this step, train_df
will contain the predictor variables of our training data (i.e. all variables excluding the target variable), and train_labels
will contain the target variable values for train_df
. The same applies to test_df
and test_labels
.
We perform this operation to prepare for the next step of data scaling.
Note that Pandas'
pop()
function will return the specified column (in our case, it isSalePrice
) from the dataframe (train_df
for example) with removing that column from the dataframe.
At the end of this step, here are the number of records (rows) and features (columns) for each of train_df
and test_df
:
Set | Number of records | Number of features |
`train_df` | 2344 | 67 |
`test_df` | 586 | 67 |
Moreover, train_labels
has 2,344 labels for the 2,344 records of train_df
and test_labels
has 586 labels for the 586 records in test_df
.
Without preprocessing this data, we would have a much messier dataset to work with.
Data Scaling: Standardization
Finally, we will standardize each variable - except the target variable, of course - in our data.
For training data which is stored now in train_df
, we will calculate the mean and standard deviation of each variable. After that, we will subtract the mean from the values of each variable and then divide the resulting values by the standard deviation.
For testing data, we will subtract the training data mean from the values of each variable and then divide the resulting values by the training data standard deviation.
If you'd like to read up on Calculating Mean, Median and Mode in Python or Calculating Variance and Standard Deviation in Python, we've got you covered!
We use values calculated using training data because of the general principle: anything you learn, must be learned from the model's training data. Everything from the test dataset will be completely unknown to the model before testing.
Let's perform the standardization now:
predictor_vars = train_df.columns
for col in predictor_vars:
# Calculating variable mean and std from training data
col_mean = train_df[col].mean()
col_std = train_df[col].std()
if col_std == 0:
col_std = 1e-20
train_df[col] = (train_df[col] - col_mean) / col_std
test_df[col] = (test_df[col] - col_mean) / col_std
In this code, we first get the names of the predictor variables in our data. These names are the same for training and testing sets because these two sets contain the same variables but different data values.
Then for each predictor variable, we calculate the mean and standard deviation using the training data (train_df
), subtract the calculated mean and divide by the calculated standard deviation.
Note that sometimes, the standard deviation is equal to 0 for some variables. In that case, we make the standard deviation equal to an extremely small amount because if we keep it equal to 0, we will get a division-by-zero error when we use it for division later.
This nets us scaled and standardized data in the range of -1 and 1.
With that done, our dataset is ready to be used to train and evaluate a model. We'll be building a deep neural network in the next article.
Conclusion
Data preprocessing is a crucial step in a Machine Learning pipeline. Without dropping certain variables, dealing with missing values, encoding categorical values and standardization - we'd be feeding a messy (or impossible) dataset into a model.
The model will only be as good as the data we feed it and in this article - we've prepped a dataset to fit a model.
from Planet Python
via read more
No comments:
Post a Comment