Monday, February 1, 2021

Stack Abuse: Python: How to Handle Missing Data in Pandas DataFrame

Introduction

Pandas is a Python library for data analysis and manipulation. Almost all operations in pandas revolve around DataFrames, an abstract data structure tailor-made for handling a metric ton of data.

In the aforementioned metric ton of data, some of it is bound to be missing for various reasons. Resulting in a missing (null/None/Nan) value in our DataFrame.

Which is why, in this article, we'll be discussing how to handle missing data in a Pandas DataFrame.

Data Inspection

Real-world datasets are rarely perfect. They may contain missing values, wrong data types, unreadable characters, erroneous lines, etc.

The first step to to any proper data analysis is cleaning and organizing the data we'll later be using. We will discuss a few common problems related to data that might occur in a dataset.

We will be working with small employees dataset for this. The .csv file looks like this:

First Name,Gender,Salary,Bonus %,Senior Management,Team
Douglas,Male,97308,6.945,TRUE,Marketing
Thomas,Male,61933,NaN,TRUE,
Jerry,Male,NA,9.34,TRUE,Finance
Dennis,n.a.,115163,10.125,FALSE,Legal
,Female,0,11.598,,Finance
Angela,,,18.523,TRUE,Engineering
Shawn,Male,111737,6.414,FALSE,na
Rachel,Female,142032,12.599,FALSE,Business Development
Linda,Female,57427,9.557,TRUE,Client Services
Stephanie,Female,36844,5.574,TRUE,Business Development
,,,,,

Let's import it into a DataFrame:

df = pd.read_csv('out.csv')
df

This results in:

original dataframe pandas

Taking a closer look at the dataset, we note that Pandas automatically assigns NaN if the value for a particular column is an empty string '' NA or NaN. However, there are cases where missing values are represented by a custom value, for example, the string 'na' or 0 for a numeric column.

For example, the 6th row has a value of na for the Team column, while the 5th row has a value of 0 for the Salary column.

Customizing Missing Data Values

In our dataset, we want to consider these as missing values:

  1. A 0 value in the Salary column
  2. An na value in the Team column

This can be achieved by using the na_values argument to set custom missing values. This argument represents a dictionary where the keys represent a column name and the value represents the data values that are to be considered as missing:

# This means that in Salary column, 0 is also considered a missing value.
# And 'na' is considered a missing value in the Team column.
df = pd.read_csv('out.csv', na_values={"Salary" : [0], "Team" : ['na']})
df

This results in:

customized df

On the other hand if we want to map a list of values which must be treated as missing values in all columns, we can pass a list of candidates which we want to globally consider as missing values to the na_values parameter:

missing_values = ["n.a.","NA","n/a", "na", 0]
df = pd.read_csv('out.csv', na_values = missing_values)
df

This results in:

global missing values pandas

Note that here, the value of Gender in the 4th row is also NaN now since we defined n.a. as a missing value above.

Which implementation to chose depends upon the nature of the dataset.

For example, some of the numeric columns in the dataset might need to treat 0 as a missing value while other columns may not. Therefore, you can use the first approach where you customize missing values based on columns.

Likewise, if we want to treat 0 for example as a missing value globally, we can utilize the second method and just pass an array of such values to the na_values argument.

Once we have identified all the missing values in the DataFrame and annotated them correctly, there are several ways we can handle missing data.

Removing Rows With Missing Values

One approach would be removing all the rows which contain missing values. This can easily be done with the dropna() function, specifically dedicated to this:

# Drops all rows with NaN values
df.dropna(axis=0,inplace=True)

This results in:

drop nan values pandas

inplace = True makes all the changes in the existing DataFrame without returning a new one. Without it, you'd have to re-assign the DataFrame to itself.

The axis argument specifies if you're working with rows or columns - 0 being rows, and 1 being columns.

You can control whether you want to remove the rows containing at least 1 NaN or all NaN values by setting the how parameter in the dropna method.

how : {‘any’, ‘all’}

  • any: if any NA values are present, drop that label
  • all: if all values are NA, drop that label
df.dropna(axis=0,inplace=True, how='all')

This would only remove the last row from the dataset since how=all would only drop a row if all of the values are missing from the row.

Similarly, to drop columns containing missing values, just set axis=1 in the dropna method.

Filling out Missing Values

It might not be the best approach to remove the rows containing missing values if such rows are abundant. They might contain valuable data in other columns and we don't want to skew the data towards an inaccurate state.

In this case, we have several options to assign appropriate values. The most common ones are listed below:

  • Fill NA with Mean, Median or Mode of the data
  • Fill NA with a constant value
  • Forward Fill or Backward Fill NA
  • Interpolate Data and Fill NA

Let's go through one by one.

Fill Missing DataFrame Values with Column Mean, Median and Mode

Let's start out with the fillna() method. It fills the NA-marked values with values you supply the method with.

For example, you can use the .median(), .mode() and .mean() functions on a column, and supply those as the fill value:

# Using median
df['Salary'].fillna(df['Salary'].median(), inplace=True)
  
# Using mean
df['Salary'].fillna(int(df['Salary'].mean()), inplace=True)
  
# Using mode
df['Salary'].fillna(int(df['Salary'].mode()), inplace=True)

Now, if a salary is missing from a person's rows, a mean, mode or median are used to fill that value. This way, you're not dropping these people from the dataset, and you're also not skewing the salary values.

While not perfect, this method allows you to introduce values that don't impact the overall dataset, since no matter how many averages you add, the average stays the same.

Fill Missing DataFrame Values with a Constant

You could also decide to fill the NA-marked values with a constant value. For example, you can put in a special string or numerical value:

df['Salary'].fillna(0, inplace=True)

At the very least, these values are now actual values, instead of na or NaN.

Forward Fill Missing DataFrame Values

This method would fill the missing values with first non-missing value that occurs before it:

df['Salary'].fillna(method='ffill', inplace=True)

Backward Fill Missing DataFrame Values

This method would fill the missing values with first non-missing value that occurs after it:

df['Salary'].fillna(method='bfill', inplace=True)

Fill Missing DataFrame Values with Interpolation

Finally, this method uses mathematical interpolation to determine what value would have been in the place of a missing value:

df['Salary'].interpolate(method='polynomial')

Conclusion

Data cleaning and preprocessing is a very important part of every data analysis and each data science project.

In this article we went over several techniques to handle missing data which included customizing the missing data values and imputing the missing data values using different methods including mean, median, mode, a constant value, forward fill, backward fill and interpolation.



from Planet Python
via read more

No comments:

Post a Comment

TestDriven.io: Working with Static and Media Files in Django

This article looks at how to work with static and media files in a Django project, locally and in production. from Planet Python via read...