Wednesday, July 29, 2020

Stack Abuse: Deep Learning in Keras - Data Preprocessing

Introduction

Deep learning is one of the most interesting and promising areas of artificial intelligence (AI) and machine learning currently. With great advances in technology and algorithms in recent years, deep learning has opened the door to a new era of AI applications.

In many of these applications, deep learning algorithms performed equal to human experts and sometimes surpassed them.

Python has become the go-to language for Machine Learning and many of the most popular and powerful deep learning libraries and frameworks like TensorFlow, Keras, and PyTorch are built in Python.

In this series, we'll be using Keras to perform Exploratory Data Analysis (EDA), Data Preprocessing and finally, build a Deep Learning Model and evaluate it.

If you haven't already, check out our first article - Deep Learning Models in Keras - Exploratory Data Analysis (EDA).

Data Preprocessing

In the preprocessing stage, we'll prepare the data to be fed to the Keras model. The first step is clearing the dataset of null values. Then, we'll use one-hot encoding to convert categorical variables to numerical variables. Neural Nets work with numerical data, not categorical.

We'll also split the data into a training and testing set. Finally, we'll scale the data/standardize it so that it ranges from -1 to 1. This standardization helps both train the model better and allows it to converge easier.

Dealing with Missing Values

Let's find out the number and percentage of missing values in each variable in the dataset:

missing_values = pd.DataFrame({
    'Column': df.columns.values,
    '# of missing values': df.isna().sum().values,
    '% of missing values': 100 * df.isna().sum().values / len(df),
})

missing_values = missing_values[missing_values['# of missing values'] > 0]
print(missing_values.sort_values(by='# of missing values', 
                                 ascending=False
                                ).reset_index(drop=True))

This code will produce the following table which shows us variables that contain missing values and how many missing values they contain:

Column # of missing values % of missing values
0 Pool QC 2917 99.5563
1 Misc Feature 2824 96.3823
2 Alley 2732 93.2423
3 Fence 2358 80.4778
4 Fireplace Qu 1422 48.5324
5 Lot Frontage 490 16.7235
6 Garage Cond 159 5.42662
7 Garage Qual 159 5.42662
8 Garage Finish 159 5.42662
9 Garage Yr Blt 159 5.42662
10 Garage Type 157 5.35836
11 Bsmt Exposure 83 2.83276
12 BsmtFin Type 2 81 2.76451
13 BsmtFin Type 1 80 2.73038
14 Bsmt Qual 80 2.73038
15 Bsmt Cond 80 2.73038
16 Mas Vnr Area 23 0.784983
17 Mas Vnr Type 23 0.784983
18 Bsmt Half Bath 2 0.0682594
19 Bsmt Full Bath 2 0.0682594
20 Total Bsmt SF 1 0.0341297

Since Pool QC, Misc Feature, Alley, Fence, and Fireplace Qu variables contain a high percentage of missing values as shown in the table, we will simply remove them as they probably won't affect the results much at all:

df.drop(['Pool QC', 'Misc Feature', 'Alley', 'Fence', 'Fireplace Qu'], 
        axis=1, inplace=True)

For other variables that contain missing values, we will replace these missing values depending on the data type of the variable: whether it is numerical or categorical.

If it is numerical, we will replace missing values with the variable mean. If it is categorical, we will replace the missing values with the variable mode. This removes the false bias that can be created with missing values in a neutral way.

To know which variables are numerical and which are categorical, we will print out 5 unique items for each of the variables that contain missing values using this code:

cols_with_missing_values = df.columns[df.isna().sum() > 0]
for col in cols_with_missing_values:
    print(col)
    print(df[col].unique()[:5])
    print('*'*30)

And we get the following results:

Lot Frontage
[141.  80.  81.  93.  74.]
******************************
Mas Vnr Type
['Stone' 'None' 'BrkFace' nan 'BrkCmn']
******************************
...

Let's replace the values of missing numerical values with the mean:

num_with_missing = ['Lot Frontage', 'Mas Vnr Area', 'BsmtFin SF 1', 'BsmtFin SF 2', 
                    'Bsmt Unf SF', 'Total Bsmt SF', 'Bsmt Full Bath', 'Bsmt Half Bath', 
                    'Garage Yr Blt', 'Garage Cars', 'Garage Area']

for n_col in num_with_missing:
    df[n_col] = df[n_col].fillna(df[n_col].mean())

Here, we just put them all in a list and assigned new values to them. Next, let's replace missing values for categorical variables:

cat_with_missing = [x for x in cols_with_missing_values if x not in num_with_missing]

for c_col in cat_with_missing:
    df[c_col] = df[c_col].fillna(df[c_col].mode().to_numpy()[0])

After this step, our dataset will have no missing values in it.

One-Hot Encoding of Categorical Variables

Keras models, like all machine learning models fundamentally work with numerical data. Categorical data has no meaning to a computer, but it does do us. We need to convert these categorical variables into numerical representations in order for the dataset to be usable.

The technique that we will use to do that conversion is One-Hot Encoding. Pandas provides us with a simple way to automatically perform One-Hot encoding on all categorical variables in the data.

Before that though, we must ensure that no categorical variable in our data is represented as a numerical variable by accident.

Checking Variables Data Types

When we read a CSV dataset using Pandas as we did, Pandas automatically tries to determine the type of each variable in the dataset.

Sometimes, Pandas can determine this incorrectly - if a categorical variable is represented with numbers, it can wrongfully infer that it's a numerical variable.

Let's check if there are any data type discrepancies in the DataFrame:

data_types = pd.DataFrame({
    'Column': df.select_dtypes(exclude='object').columns.values,
    'Data type': df.select_dtypes(exclude='object').dtypes.values
})

print(data_types)
Column Data type
0 MS SubClass int64
1 Lot Frontage float64
2 Lot Area int64
3 Overall Qual int64
4 Overall Cond int64
5 Year Built int64
6 Year Remod/Add int64

Based on this table and the variables descriptions from Kaggle, we can notice which variables were falsely considered numerical by Pandas.

For example, MS SubClass was detected as a numerical variable with a data type of int64. However, based on the description of this variable, it specifies the type of the unit being sold.

If we take a look at the unique values of this variable:

df['MS SubClass'].unique().tolist()

We get this output:

[20, 60, 120, 50, 85, 160, 80, 30, 90, 190, 45, 70, 75, 40, 180, 150]

This variable represent different unit types as numbers like 20 (one story dwellings built in 1946 and newer), 60 (2 story dwellings built in 1946 and newer), etc.

This actually isn't a numerical variable but a categorical one. Let's convert it back into a categorical variable by reassigning it as a string:

df['MS SubClass'] = df['MS SubClass'].astype(str)

Performing One-Hot Encoding

Before performing One-Hot Encoding, we want to select a subset of the features from our data to use from now on. We'll want to do so because our dataset contains 2,930 records and 75 features.

Many of these features are categorical. So if we keep all the features and perform One-Hot Encoding, the resulting number of features will be large and the model might suffer from the curse of dimensionality as a result.

Let's make a list of the variables we want to keep in a subset and trim the DataFrame so we only use these:

selected_vars = ['MS SubClass', 'MS Zoning', 'Lot Frontage', 'Lot Area',
                 'Neighborhood', 'Overall Qual', 'Overall Cond',
                 'Year Built', 'Total Bsmt SF', '1st Flr SF', '2nd Flr SF',
                 'Gr Liv Area', 'Full Bath', 'Half Bath', 'Bedroom AbvGr', 
                 'Kitchen AbvGr', 'TotRms AbvGrd', 'Garage Area', 
                 'Pool Area', 'SalePrice']

df = df[selected_vars]

Now we can perform One-Hot Encoding easily by using Pandas' get_dummies() function:

df = pd.get_dummies(df)

After one-hot encoding, the dataset will have 67 variables. Here are the capped first few rows - there are many more variables than this:

Lot Frontage Lot Area Overall Qual Overall Cond Year Built Total Bsmt SF 1st Flr SF 2nd Flr SF Gr Liv Area
0 141 31770 6 5 1960 1080 1656 0 1656
1 80 11622 5 6 1961 882 896 0 896
2 81 14267 6 6 1958 1329 1329 0 1329

Splitting Data into Training and Testing Sets

One of the last steps in data preprocessing is to split it in a training and testing subset. We'll be training the model on the training subset, and evaluating it with an unseen test set.

We will split the data randomly so that the training set will have 80% of the data and the testing set will have 20% of the data. Generally, the training set typically has anywhere between 70-80% of the data, while 20-30% is used for validation.

This is made really simple with Pandas' sample() and drop() functions:

train_df = df.sample(frac=0.8, random_state=9)
test_df = df.drop(train_df.index)

Now train_df holds our training data and test_df holds our testing data.

Next, we will store the target variable SalePrice separately for each of the training and testing sets:

train_labels = train_df.pop('SalePrice')
test_labels = test_df.pop('SalePrice')

We're removing the SalePrice value because, well, we want to predict it. There's no point predicting something we already know and have fed to the model. We'll be using the actual values to verify if our predictions are correct.

After this step, train_df will contain the predictor variables of our training data (i.e. all variables excluding the target variable), and train_labels will contain the target variable values for train_df. The same applies to test_df and test_labels.

We perform this operation to prepare for the next step of data scaling.

Note that Pandas' pop() function will return the specified column (in our case, it is SalePrice) from the dataframe (train_df for example) with removing that column from the dataframe.

At the end of this step, here are the number of records (rows) and features (columns) for each of train_df and test_df:

Set Number of records Number of features
`train_df` 2344 67
`test_df` 586 67

Moreover, train_labels has 2,344 labels for the 2,344 records of train_df and test_labels has 586 labels for the 586 records in test_df.

Without preprocessing this data, we would have a much messier dataset to work with.

Data Scaling: Standardization

Finally, we will standardize each variable - except the target variable, of course - in our data.

For training data which is stored now in train_df, we will calculate the mean and standard deviation of each variable. After that, we will subtract the mean from the values of each variable and then divide the resulting values by the standard deviation.

For testing data, we will subtract the training data mean from the values of each variable and then divide the resulting values by the training data standard deviation.

If you'd like to read up on Calculating Mean, Median and Mode in Python or Calculating Variance and Standard Deviation in Python, we've got you covered!

We use values calculated using training data because of the general principle: anything you learn, must be learned from the model's training data. Everything from the test dataset will be completely unknown to the model before testing.

Let's perform the standardization now:

predictor_vars = train_df.columns

for col in predictor_vars:
    # Calculating variable mean and std from training data
    col_mean = train_df[col].mean()
    col_std = train_df[col].std()
    if col_std == 0:
        col_std = 1e-20
    train_df[col] = (train_df[col] - col_mean) / col_std
    test_df[col] = (test_df[col] - col_mean) / col_std    

In this code, we first get the names of the predictor variables in our data. These names are the same for training and testing sets because these two sets contain the same variables but different data values.

Then for each predictor variable, we calculate the mean and standard deviation using the training data (train_df), subtract the calculated mean and divide by the calculated standard deviation.

Note that sometimes, the standard deviation is equal to 0 for some variables. In that case, we make the standard deviation equal to an extremely small amount because if we keep it equal to 0, we will get a division-by-zero error when we use it for division later.

This nets us scaled and standardized data in the range of -1 and 1.

With that done, our dataset is ready to be used to train and evaluate a model. We'll be building a deep neural network in the next article.

Conclusion

Data preprocessing is a crucial step in a Machine Learning pipeline. Without dropping certain variables, dealing with missing values, encoding categorical values and standardization - we'd be feeding a messy (or impossible) dataset into a model.

The model will only be as good as the data we feed it and in this article - we've prepped a dataset to fit a model.



from Planet Python
via read more

No comments:

Post a Comment

TestDriven.io: Working with Static and Media Files in Django

This article looks at how to work with static and media files in a Django project, locally and in production. from Planet Python via read...