Monday, November 9, 2020

Chris Moffitt: Comprehensive Guide to Grouping and Aggregating with Pandas

Introduction

One of the most basic analysis functions is grouping and aggregating data. In some cases, this level of analysis may be sufficient to answer business questions. In other instances, this activity might be the first step in a more complex data science analysis. In pandas, the groupby function can be combined with one or more aggregation functions to quickly and easily summarize data. This concept is deceptively simple and most new pandas users will understand this concept. However, they might be surprised at how useful complex aggregation functions can be for supporting sophisticated analysis.

This article will quickly summarize the basic pandas aggregation functions and show examples of more complex custom aggregations. Whether you are a new or more experienced pandas user, I think you will learn a few things from this article.

Aggregating

In the context of this article, an aggregation function is one which takes multiple individual values and returns a summary. In the majority of the cases, this summary is a single value.

The most common aggregation functions are a simple average or summation of values. As of pandas 0.20, you may call an aggregation function on one or more columns of a DataFrame.

Here’s a quick example of calculating the total and average fare using the Titanic dataset (loaded from seaborn):

import pandas as pd
import seaborn as sns

df = sns.load_dataset('titanic')

df['fare'].agg(['sum', 'mean'])
sum     28693.949300
mean       32.204208
Name: fare, dtype: float64

This simple concept is a necessary building block for more complex analysis.

One area that needs to be discussed is that there are multiple ways to call an aggregation function. As shown above, you may pass a list of functions to apply to one or more columns of data.

What if you want to perform the analysis on only a subset of columns? There are two other options for aggregations: using a dictionary or a named aggregation.

Here is a comparison of the the three options:

Pandas aggregation options

It is important to be aware of these options and know which one to use when.

Choosing an aggregation approach
As a general rule, I prefer to use dictionaries for aggregations.

The tuple approach is limited by only being able to apply one aggregation at a time to a specific column. If I need to rename columns, then I will use the rename function after the aggregations are complete. In some specific instances, the list approach is a useful shortcut. I will reiterate though, that I think the dictionary approach provides the most robust approach for the majority of situations.

Groupby

Now that we know how to use aggregations, we can combine this with groupby to summarize data.

Basic math

The most common built in aggregation functions are basic math functions including sum, mean, median, minimum, maximum, standard deviation, variance, mean absolute deviation and product.

We can apply all these functions to the fare while grouping by the embark_town :

agg_func_math = {
    'fare':
    ['sum', 'mean', 'median', 'min', 'max', 'std', 'var', 'mad', 'prod']
}
df.groupby(['embark_town']).agg(agg_func_math).round(2)
Basic math functions

This is all relatively straightforward math.

As an aside, I have not found a good usage for the prod function which computes the product of all the values in a group. For the sake of completeness, I am including it.

One other useful shortcut is to use describe to run multiple built-in aggregations at one time:

agg_func_describe = {'fare': ['describe']}
df.groupby(['embark_town']).agg(agg_func_describe).round(2)
Basic math functions

Counting

After basic math, counting is the next most common aggregation I perform on grouped data. In some ways, this can be a little more tricky than the basic math. Here are three examples of counting:

agg_func_count = {'embark_town': ['count', 'nunique', 'size']}
df.groupby(['deck']).agg(agg_func_count)
Basic math functions

The major distinction to keep in mind is that count will not include NaN values whereas size will. Depending on the data set, this may or may not be a useful distinction. In addition, the nunique function will exclude NaN values in the unique counts. Keep reading for an example of how to include NaN in the unique value counts.

First and last

In this example, we can select the highest and lowest fare by embarked town. One important point to remember is that you must sort the data first if you want first and last to pick the max and min values.

agg_func_selection = {'fare': ['first', 'last']}
df.sort_values(by=['fare'],
            ascending=False).groupby(['embark_town'
                                        ]).agg(agg_func_selection)
Basic math functions

In the example above, I would recommend using max and min but I am including first and last for the sake of completeness. In other applications (such as time series analysis) you may want to select the first and last values for further analysis.

Another selection approach is to use idxmax and idxmin to select the index value that corresponds to the maximum or minimum value.

agg_func_max_min = {'fare': ['idxmax', 'idxmin']}
df.groupby(['embark_town']).agg(agg_func_max_min)
Max and Min index

We can check the results:

df.loc[[258, 378]]
Idxmax

Here’s another shortcut trick you can use to see the rows with the max fare :

df.loc[df.groupby('class')['fare'].idxmax()]
Idxmax

The above example is one of those places where the list-based aggregation is a useful shortcut.

Other libraries

You are not limited to the aggregation functions in pandas. For instance, you could use stats functions from scipy or numpy.

Here is an example of calculating the mode and skew of the fare data.

from scipy.stats import skew, mode
agg_func_stats = {'fare': [skew, mode, pd.Series.mode]}
df.groupby(['embark_town']).agg(agg_func_stats)
Stats functions

The mode results are interesting. The scipy.stats mode function returns the most frequent value as well as the count of occurrences. If you just want the most frequent value, use pd.Series.mode.

The key point is that you can use any function you want as long as it knows how to interpret the array of pandas values and returns a single value.

Working with text

When working with text, the counting functions will work as expected. You can also use scipy’s mode function on text data.

One interesting application is that if you a have small number of distinct values, you can use python’s set function to display the full list of unique values.

This summary of the class and deck shows how this approach can be useful for some data sets.

agg_func_text = {'deck': [ 'nunique', mode, set]}
df.groupby(['class']).agg(agg_func_text)
Stats functions

Custom functions

The pandas standard aggregation functions and pre-built functions from the python ecosystem will meet many of your analysis needs. However, you will likely want to create your own custom aggregation functions. There are four methods for creating your own functions.

To illustrate the differences, let’s calculate the 25th percentile of the data using four approaches:

First, we can use a partial function:

from functools import partial
# Use partial
q_25 = partial(pd.Series.quantile, q=0.25)
q_25.__name__ = '25%'

Next, we define our own function (which is a small wrapper around quantile ):

# Define a function
def percentile_25(x):
    return x.quantile(.25)

We can define a lambda function and give it a name:

# Define a lambda function
lambda_25 = lambda x: x.quantile(.25)
lambda_25.__name__ = 'lambda_25%'

Or, define the lambda inline:

# Use a lambda function inline
agg_func = {
    'fare': [q_25, percentile_25, lambda_25, lambda x: x.quantile(.25)]
}

df.groupby(['embark_town']).agg(agg_func).round(2)
Custom agg functions

As you can see, the results are the same but the labels of the column are all a little different. This is an area of programmer preference but I encourage you to be familiar with the options since you will encounter most of these in online solutions.

Choosing an custom function style
I prefer to use custom functions or inline lambdas.

Like many other areas of programming, this is an element of style and preference but I encourage you to pick one or two approaches and stick with them for consistency.

Custom function examples

As shown above, there are multiple approaches to developing custom aggregation functions. I will go through a few specific useful examples to highlight how they are frequently used.

In most cases, the functions are lightweight wrappers around built in pandas functions. Part of the reason you need to do this is that there is no way to pass arguments to aggregations. Some examples should clarify this point.

If you want to count the number of null values, you could use this function:

def count_nulls(s):
    return s.size - s.count()

If you want to include NaN values in your unique counts, you need to pass dropna=False to the nunique function.

def unique_nan(s):
    return s.nunique(dropna=False)

Here is a summary of all the values together:

agg_func_custom_count = {
    'embark_town': ['count', 'nunique', 'size', unique_nan, count_nulls, set]
}
df.groupby(['deck']).agg(agg_func_custom_count)
Custom agg functions

If you want to calculate the 90th percentile, use quantile :

def percentile_90(x):
    return x.quantile(.9)

If you want to calculate a trimmed mean where the lowest 10th percent is excluded, use the scipy stats function trim_mean :

def trim_mean_10(x):
    return trim_mean(x, 0.1)

If you want the largest value, regardless of the sort order (see notes above about first and last :

def largest(x):
    return x.nlargest(1)

This is equivalent to max but I will show another example of nlargest below to highlight the difference.

I wrote about sparklines before. Refer to that article for install instructions. Here’s how to incorporate them into an aggregate function for a unique view of the data:

def sparkline_str(x):
    bins=np.histogram(x)[0]
    sl = ''.join(sparklines(bins))
    return sl

Here they are all put together:

agg_func_largest = {
    'fare': [percentile_90, trim_mean_10, largest, sparkline_str]
}
df.groupby(['class', 'embark_town']).agg(agg_func_largest)
Sparkline function

The nlargest and nsmallest functions can be useful for summarizing the data in various scenarios. Here is code to show the total fares for the top 10 and bottom 10 individuals:

def top_10_sum(x):
    return x.nlargest(10).sum()

def bottom_10_sum(x):
    return x.nsmallest(10).sum()


agg_func_top_bottom_sum = {
    'fare': [top_10_sum, bottom_10_sum]
}
df.groupby('class').agg(agg_func_top_bottom_sum)
Custom agg functions

Using this approach can be useful when applying the Pareto principle to your own data.

Custom functions with multiple columns

If you have a scenario where you want to run multiple aggregations across columns, then you may want to use the groupby combined with apply as described in this stack overflow answer.

Using this method, you will have access to all of the columns of the data and can choose the appropriate aggregation approach to build up your resulting DataFrame (including the column labels):

def summary(x):
    result = {
        'fare_sum': x['fare'].sum(),
        'fare_mean': x['fare'].mean(),
        'fare_range': x['fare'].max() - x['fare'].min()
    }
    return pd.Series(result).round(0)

df.groupby(['class']).apply(summary)
Custom agg functions

Using apply with groupy gives maximum flexibility over all aspects of the results. However, there is a downside. The apply function is slow so this approach should be used sparingly.

Working with group objects

Once you group and aggregate the data, you can do additional calculations on the grouped objects.

For the first example, we can figure out what percentage of the total fares sold can be attributed to each embark_town and class combination. We use assign and a lambda function to add a pct_total column:

df.groupby(['embark_town', 'class']).agg({
    'fare': 'sum'
}).assign(pct_total=lambda x: x / x.sum())
Percent of total

One important thing to keep in mind is that you can actually do this more simply using a pd.crosstab as described in my previous article:

pd.crosstab(df['embark_town'],
            df['class'],
            values=df['fare'],
            aggfunc='sum',
            normalize=True)
Crosstab example

While we are talking about crosstab , a useful concept to keep in mind is that agg functions can be combined with pivot tables too.

Here’s a quick example:

pd.pivot_table(data=df,
            index=['embark_town'],
            columns=['class'],
            aggfunc=agg_func_top_bottom_sum)
Custom agg functions with a pivot table

Sometimes you will need to do multiple groupby’s to answer your question. For instance, if we wanted to see a cumulative total of the fares, we can group and aggregate by town and class then group the resulting object and calculate a cumulative sum:

fare_group = df.groupby(['embark_town', 'class']).agg({'fare': 'sum'})
fare_group.groupby(level=0).cumsum()
Custom agg functions with cumulative sum

This may be a little tricky to understand. Here’s a summary of what we are doing:

Multiple groupby with cumulative sums

Here’s another example where we want to summarize daily sales data and convert it to a cumulative daily and quarterly view. Refer to the Grouper article if you are not familiar with using pd.Grouper() :

In the first example, we want to include a total daily sales as well as cumulative quarter amount:

sales = pd.read_excel('https://github.com/chris1610/pbpython/blob/master/data/2018_Sales_Total_v2.xlsx?raw=True')

daily_sales = sales.groupby([pd.Grouper(key='date', freq='D')
                            ]).agg(daily_sales=('ext price',
                                                'sum')).reset_index()
daily_sales['quarter_sales'] = daily_sales.groupby(
    pd.Grouper(key='date', freq='Q')).agg({'daily_sales': 'cumsum'})

To understand this, you need to look at the quarter boundary (end of March through start of April) to get a good sense of what is going on.

Cumulative total

If you want to just get a cumulative quarterly total, you can chain multiple groupby functions.

First, group the daily results, then group those results by quarter and use a cumulative sum:

sales.groupby([pd.Grouper(key='date', freq='D')
            ]).agg(daily_sales=('ext price', 'sum')).groupby(
                pd.Grouper(freq='Q')).agg({
                    'daily_sales': 'cumsum'
                }).rename(columns={'daily_sales': 'quarterly_sales'})
Cumulative quarterly total

In this example, I included the named aggregation approach to rename the variable to clarify that it is now daily sales. I then group again and use the cumulative sum to get a running sum for the quarter. Finally, I rename the column to quarterly sales.

Admittedly this is a bit tricky to understand. However, if you take it step by step and build out the function and inspect the results at each step, you will start to get the hang of it. Don’t be discouraged!

Flattening Hierarchical Column Indices

By default, pandas creates a hierarchical column index on the summary DataFrame. Here is what I am referring to:

df.groupby(['embark_town', 'class']).agg({'fare': ['sum', 'mean']}).round(0)
Hierarchical index

At some point in the analysis process you will likely want to “flatten” the columns so that there is a single row of names.

I have found that the following approach works best for me. I use the parameter as_index=False when grouping, then build a new collapsed column name.

Here is the code:

multi_df = df.groupby(['embark_town', 'class'],
                    as_index=False).agg({'fare': ['sum', 'mean']})

multi_df.columns = [
'_'.join(col).rstrip('_') for col in multi_df.columns.values
]

Here is a picture showing what the flattened frame looks like:

Flatten hierarchical columns

I prefer to use _ as my separator but you could use other values. Just keep in mind that it will be easier for your subsequent analysis if the resulting column names do not have spaces.

Subtotals

One process that is not straightforward with grouping and aggregating in pandas is adding a subtotal. If you want to add subtotals, I recommend the sidetable package. Here is how you can summarize fares by class , embark_town and sex with a subtotal at each level as well as a grand total at the bottom:

import sidetable
df.groupby(['class', 'embark_town', 'sex']).agg({'fare': 'sum'}).stb.subtotal()
Subtotal

sidetable also allows customization of the subtotal levels and resulting labels. Refer to the package documentation for more examples of how sidetable can summarize your data.

Summary

Thanks for reading this article. There is a lot of detail here but that is due to how many different uses there are for grouping and aggregating data with pandas. My hope is that this post becomes a useful resource that you can bookmark and come back to when you get stuck with a challenging problem of your own.

If you have other common techniques you use frequently please let me know in the comments. If I get some broadly useful ones, I will include in this post or as an updated article.

image credit: Herman Traub



from Planet Python
via read more

No comments:

Post a Comment

TestDriven.io: Working with Static and Media Files in Django

This article looks at how to work with static and media files in a Django project, locally and in production. from Planet Python via read...