Monday, January 6, 2020

Erik Marsja: Pandas drop_duplicates(): How to Drop Duplicated Rows

The post Pandas drop_duplicates(): How to Drop Duplicated Rows appeared first on Erik Marsja.

In this post, we will learn how to use Pandas drop_duplicates() to remove duplicate records and combinations of columns from a Pandas dataframe. That is, we will delete duplicate data and only keep the unique values.

This Pandas tutorial will cover the following; what’s needed to follow the tutorial, importing Pandas, and how to create a dataframe fro a dictionary. After this, we will get into how to use Pandas drop_duplicates() to drop duplicate rows and duplicate columns.

Note, that we will drop duplicates using Pandas and Pyjanitor, which is a Python package that extends Pandas with an API based on verbs. It’s much like working with the Tidyverse packages in R. See this post on more about working with Pyjanitor. However, we will only use Pyjanitor to drop duplicate columns from a Pandas dataframe.

Prerequisites

As usual in the Python tutorials focusing on Pandas, we need to have both Python 3 and Pandas installed. Python can be installed by downloaded here or by installing a Python distribution such as Anaconda or Canopy. If we install Anaconda, for instance, we’ll also get Pandas. Obviously, if we want to use Pyjanitor, we also need to install it: pip install pyjanitor. Note, this post explains how to install, use, and upgrade Python packages using pip or conda. That said, now we can continue the Pandas drop duplicates tutorial.

Creating a Dataframe from a Dictionary

In this section, of the Pandas drop_duplicates() tutorial, we are going to create a Pandas dataframe from a dictionary. First, we are going to create a dictionary and then we use pd.Dataframe() to create a Pandas dataframe.

import pandas as pd

data = {'FirstName':['Steve', 'Steve', 'Erica',
                      'John', 'Brody', 'Lisa', 'Lisa'],
        'SurName':['Johnson', 'Johnson', 'Ericson',
                  'Peterson', 'Stephenson', 'Bond', 'Bond'],
        'Age':[34, 34, 40, 
                44, 66, 51, 51],
        'Sex':['M', 'M', 'F', 'M',
               'M', 'F', 'F']}

df = pd.DataFrame(data)

Now, before removing duplicate rows we are going to print the dataframe with the highlighted rows. Here, we use the Styling API which enables us to apply conditional styling. In the code chunk below, we create a function that checks for duplicate rows and then set the color of the rows to the color red.

def highlight_dupes(x):
    df = x.copy()
    
    df['Dup'] = df.duplicated(keep=False)
    mask = df['Dup'] == True
    
    df.loc[mask, :] = 'background-color: red'
    df.loc[~mask,:] = 'background-color: ""'
    return df.drop('Dup', axis=1) 

df.style.apply(highlight_col, axis=None))
How to remove duplicate rows using Pandas drop_duplicates()Duplicated rows highlighted in red.

Note, this above code was found on Stackoverflow and adapted to the problem of the current post.

Finally, before going on and deleting duplicate rows we can use Pandas groupby() and size() to count the duplicated rows:

df.groupby(df.columns.tolist(),as_index=False).size()
count duplicated rows to drop in PandasDuplicated rows counted

Learn more about grouping categorical data and describing the data using the groupby() method:

Pandas drop_duplicates(): Deleting Duplicate Rows:

In this section, we are going to drop rows that are the same using Pandas drop_duplicates(). This is very simple, we just run the code example below.

df_new = df.drop_duplicates()
df_new

Now, in the image above we can see that the duplicate rows were removed from the Pandas dataframe but we can count the rows, again to double-check.

df_new.groupby(df_new.columns.tolist(), as_index=False).size()

Pandas Drop Duplicates with Subset

If we want to remove duplicates, from a Pandas dataframe, where only one or a subset of columns contains the same data we can use the subset argument. When using the subset argument with Pandas drop_duplicates(), we tell the method which column, or list of columns, we want to be unique. Now, before working with Pandas drop_duplicates(), again, we need to create a new example dataset:

df_new = df.drop_duplicates(subset=subset=['FirstName', 'SurName'])
df_new
Working with duplicated data with the drop_duplicates() method in PandasPandas dataframe with duplicated rows

Furthermore, we can also specify which duplicated row we want to keep. If we don’t change the argument it will remove the first:

df_new = df.drop_duplicates(subset=['FirstName', 'SurName'])
df_new

If we, on the other hand, set it to “last” we will see a different result:

df_new = df.drop_duplicates(keep='last', subset=subset=['FirstName', 'SurName'])
df_new

Pandas Drop Duplicated Columns using Pyjanitor

In the last section, of this Pandas drop duplicated data tutorial, we will work with Pyjanitor to remove duplicated columns. This is very easy and we do this with the drop_duplicate_columns() method and the column_name argument. Now, before dropping duplicated columns we create a dataframe from a dictionary:

code class="lang-py">import pandas as pd

data = {'FirstName':['Steve', 'Steve', 'Erica',
                      'John', 'Brody', 'Lisa', 'Lisa'],
        'SurName':['Johnson', 'Johnson', 'Ericson',
                  'Peterson', 'Stephenson', 'Bond', 'Bond'],
        'Age':[34, 34, 40, 
                44, 66, 51, 51],
        'Sex':['M', 'M', 'F', 'M',
               'M', 'F', 'F'],
        'S':['M', 'M', 'F', 'M',
               'M', 'F', 'F']}

df = pd.DataFrame(data)

df

As can be seen in the image, above, we have the duplicated columns “Sex” and “S” and we remove them now:

code class="lang-py>df = df.drop_duplicate_columns(column_name='S')

Conclusion: Using Pandas drop_duplicates()

In conclusion, using Pandas drop_duplicates() was very easy and in this post, we have learned how to:

  • Delete duplicated rows
  • Delete duplicated rows using a subset of columns
  • Delete duplicated columns

The two first things, we learned, was accomplished using drop_duplicates() in Pandas and the third thing was done with Pyjanitor.

The post Pandas drop_duplicates(): How to Drop Duplicated Rows appeared first on Erik Marsja.



from Planet Python
via read more

No comments:

Post a Comment

TestDriven.io: Working with Static and Media Files in Django

This article looks at how to work with static and media files in a Django project, locally and in production. from Planet Python via read...