Daily Python: Updated: Using Pandas To Create an Excel Diff

Introduction

Several years ago, I wrote an article about using pandas to creating a diff of two excel files. Over the years, the pandas API has changed and the diff script no longer works with the latest pandas releases. Through the magic of search engines, people are still discovering the article and are asking for help in getting it to work with more recent versions of pandas. Since pandas is closing in on a 1.0 release, I think this is a good time to get an updated version out there.

Get Started

I encourage you to read the the earlier article in order to learn more about the goals and potential limitations of this approach. This article would not have been possible without all the comentors that provided fixes and updates to the code. To be perfectly honest, there are a couple of tricky operations in this code that I probably would not have figured out if not for the kindess of the people that read this blog.

I am going to go through this fairly quickly. If there is interest, I may cover a couple of these concepts in a future post. Feel free to follow along in the notebook. The data files are on github as well.

The first step is to import pandas and read in the files:

import pandas as pd

# Read in the two files but call the data old and new and create columns to track
old = pd.read_excel('sample-address-1.xlsx', 'Sheet1', na_values=['NA'])
new = pd.read_excel('sample-address-2.xlsx', 'Sheet1', na_values=['NA'])
old['version'] = "old"
new['version'] = "new"

Here’s what the top of the new dataframe looks like:

	account number	name	street	city	state	postal code	version
0	935480	Bruen Group	5131 Nienow Viaduct Apt. 290	Port Arlie	Alabama	14118	new
1	371770	Cruickshank-Boyer	839 Lana Expressway Suite 234	South Viviana	Alabama	57838	new
2	548367	Spencer, Grady and Herman	65387 Lang Circle Apt. 516	Greenholtbury	Alaska	58394	new
3	132971	Williamson, Schumm and Hettinger	89403 Casimer Spring	Jeremieburgh	Arkansas	62785	new
4	985603	Bosco-Upton	03369 Moe Way	Port Casandra	Arkansas	86014	new

Our diff process is looking for three types of changes:

What are the new accounts?
What are the removed/dropped accounts?
What accounts have changed information?

Since everything is keyed on the account number we can use python sets to make the process simpler to understand. This change from the original approach makes the overall code a little easier to understand in my opinion. I have not tested this on large datasets to evaluate the performance.

old_accts_all = set(old['account number'])
new_accts_all = set(new['account number'])

dropped_accts = old_accts_all - new_accts_all
added_accts = new_accts_all - old_accts_all

Next we join all the data together and get a clean list of unique data and keep all changed rows by using drop_duplicates

all_data = pd.concat([old,new],ignore_index=True)
changes = all_data.drop_duplicates(subset=["account number",
                                           "name", "street",
                                           "city","state",
                                           "postal code"], keep='last')

Next, we need to figure out which account numbers have duplicate entries. A duplicate account number is an indication that they have changed values in a field that we need to flag. We can use the duplicated function to get a list of all those account numbers and filter out just those duplicated accounts:

dupe_accts = changes[changes['account number'].duplicated() == True]['account number'].tolist()
dupes = changes[changes["account number"].isin(dupe_accts)]

Here is what the dupes look like now:

	account number	name	street	city	state	postal code	version
24	595932	Kuhic, Eichmann and West	4059 Tobias Inlet	New Rylanfurt	Illinois	89271	old
30	558879	Watsica Group	95616 Enos Grove Suite 139	West Atlas	Iowa	47419	old
96	880043	Beatty Inc	3641 Schaefer Isle Suite 171	North Gardnertown	Wyoming	64318	old
123	595932	Kuhic, Eichmann and West	4059 Tobias St	New Rylanfurt	Illinois	89271	new
129	558879	Watsica Group	829 Big street	Smithtown	Ohio	47919	new
195	880043	Beatty Inc	3641 Schaefer Isle Suite 171	North Gardnertown	Wyoming	64918	new

Now we break out the old and new data, remove the unnecesary version column and set the account number as the index. These steps set up the data for the final comparison.

# Pull out the old and new data into separate dataframes
change_new = dupes[(dupes["version"] == "new")]
change_old = dupes[(dupes["version"] == "old")]

# Drop the temp columns - we don't need them now
change_new = change_new.drop(['version'], axis=1)
change_old = change_old.drop(['version'], axis=1)

# Index on the account numbers
change_new.set_index('account number', inplace=True)
change_old.set_index('account number', inplace=True)

# Combine all the changes together
df_all_changes = pd.concat([change_old, change_new],
                            axis='columns',
                            keys=['old', 'new'],
                            join='outer')

Now we have a dataframe that looks like this:

	old					new
	name	street	city	state	postal code	name	street	city	state	postal code
account number
595932	Kuhic, Eichmann and West	4059 Tobias Inlet	New Rylanfurt	Illinois	89271	Kuhic, Eichmann and West	4059 Tobias St	New Rylanfurt	Illinois	89271
558879	Watsica Group	95616 Enos Grove Suite 139	West Atlas	Iowa	47419	Watsica Group	829 Big street	Smithtown	Ohio	47919
880043	Beatty Inc	3641 Schaefer Isle Suite 171	North Gardnertown	Wyoming	64318	Beatty Inc	3641 Schaefer Isle Suite 171	North Gardnertown	Wyoming	64918

Before we do our final big combine, we need to define a function that will show us what has changed from column to column:

# Define the diff function to show the changes in each field
def report_diff(x):
    return x[0] if x[0] == x[1] else '{} ---> {}'.format(*x)

We now use the swaplevel function to get the old and new columns next to each other:

df_all_changes = df_all_changes.swaplevel(axis='columns')[change_new.columns[0:]]

Which makes this change to the dataframe:

	name		street		city		state		postal code
	old	new	old	new	old	new	old	new	old	new
account number
595932	Kuhic, Eichmann and West	Kuhic, Eichmann and West	4059 Tobias Inlet	4059 Tobias St	New Rylanfurt	New Rylanfurt	Illinois	Illinois	89271	89271
558879	Watsica Group	Watsica Group	95616 Enos Grove Suite 139	829 Big street	West Atlas	Smithtown	Iowa	Ohio	47419	47919
880043	Beatty Inc	Beatty Inc	3641 Schaefer Isle Suite 171	3641 Schaefer Isle Suite 171	North Gardnertown	North Gardnertown	Wyoming	Wyoming	64318	64918

The final tricky command is to use a groupby on the columns then apply, our custom report_diff function to compare the two corresponding columns to each other.

df_changed = df_all_changes.groupby(level=0, axis=1).apply(lambda frame: frame.apply(report_diff, axis=1))
df_changed = df_changed.reset_index()

Which yields a nice summary of the changed columns:

	account number	city	name	postal code	state	street
0	595932	New Rylanfurt	Kuhic, Eichmann and West	89271	Illinois	4059 Tobias Inlet —-> 4059 Tobias St
1	558879	West Atlas —-> Smithtown	Watsica Group	47419 —-> 47919	Iowa —-> Ohio	95616 Enos Grove Suite 139 —-> 829 Big street
2	880043	North Gardnertown	Beatty Inc	64318 —-> 64918	Wyoming	3641 Schaefer Isle Suite 171

The final analysis step is to figure out what has been removed and added:

df_removed = changes[changes["account number"].isin(dropped_accts)]
df_added = changes[changes["account number"].isin(added_accts)]

We can output everything to an Excel file with a separate tab for changes, additions and removals:

output_columns = ["account number", "name", "street", "city", "state", "postal code"]
writer = pd.ExcelWriter("my-diff.xlsx")
df_changed.to_excel(writer,"changed", index=False, columns=output_columns)
df_removed.to_excel(writer,"removed",index=False, columns=output_columns)
df_added.to_excel(writer,"added",index=False, columns=output_columns)
writer.save()

Here is what it looks like:

Conclusion

Thanks again to those that commented on the original article and provided ideas on how to improve. I hope some may find this useful. As always, if you find a great use for this script, let me know in the comments.

from Practical Business Python
read more

Daily Python

Monday, January 28, 2019

Updated: Using Pandas To Create an Excel Diff

Introduction

Get Started

Conclusion

No comments:

Post a Comment

TestDriven.io: Working with Static and Media Files in Django

Search This Blog