Monday, November 18, 2019

Pandas GroupBy: Your Guide to Grouping Data in Python

Whether you’ve just started working with Pandas and want to master one of its core facilities, or you’re looking to fill in some gaps in your understanding about .groupby(), this tutorial will help you to break down and visualize a Pandas GroupBy operation from start to finish.

This tutorial is meant to complement the official documentation, where you’ll see self-contained, bite-sized examples. Here, however, you’ll focus on three more involved walk-throughs that use real-world datasets.

In this tutorial, you’ll cover:

  • How to use Pandas GroupBy operations on real-world data
  • How the split-apply-combine chain of operations works
  • How to decompose the split-apply-combine chain into steps
  • How methods of a Pandas GroupBy object can be placed into different categories based on their intent and result

This tutorial assumes you have some experience with Pandas itself, including how to read CSV files into memory as Pandas objects with read_csv(). If you need a refresher, then check out Reading CSVs With Pandas.

You can download the source code for all the examples in this tutorial by clicking on the link below:

Housekeeping

All code in this tutorial was generated in a CPython 3.7.2 shell using Pandas 0.25.0. Before you proceed, make sure that you have the latest version of Pandas available within a new virtual environment:

$ python -m venv pandas-gb-tut
$ source ./pandas-gb-tut/bin/activate
$ python -m pip install pandas

The examples here also use a few tweaked Pandas options for friendlier output:

import pandas as pd

# Use 3 decimal places in output display
pd.set_option("display.precision", 3)

# Don't wrap repr(DataFrame) across additional lines
pd.set_option("display.expand_frame_repr", False)

# Set max rows displayed in output to 25
pd.set_option("display.max_rows", 25)

You can add these to a startup file to set them automatically each time you start up your interpreter.

In this tutorial, you’ll focus on three datasets:

  1. The U.S. Congress dataset contains public information on historical members of Congress and illustrates several fundamental capabilities of .groupby().
  2. The air quality dataset contains periodic gas sensor readings. This will allow you to work with floats and time series data.
  3. The news aggregator dataset which holds metadata on several hundred thousand news articles. You’ll be working with strings and doing groupby-based text munging.

You can download the source code for all the examples in this tutorial by clicking on the link below:

Once you’ve downloaded the .zip, you can unzip it to your current directory:

$ unzip -q -d groupby-data groupby-data.zip

The -d option lets you extract the contents to a new folder:

./
│
└── groupby-data/
    │
    ├── legislators-historical.csv
    ├── airqual.csv
    └── news.csv

With that set up, you’re ready to jump in!

Example 1: U.S. Congress Dataset

You’ll jump right into things by dissecting a dataset of historical members of Congress. You can read the CSV file into a Pandas DataFrame with read_csv():

import pandas as pd

dtypes = {
    "first_name": "category",
    "gender": "category",
    "type": "category",
    "state": "category",
    "party": "category",
}
df = pd.read_csv(
    "groupby-data/legislators-historical.csv",
    dtype=dtypes,
    usecols=list(dtypes) + ["birthday", "last_name"],
    parse_dates=["birthday"]
)

The dataset contains members’ first and last names, birth date, gender, type ("rep" for House of Representatives or "sen" for Senate), U.S. state, and political party. You can use df.tail() to vie the last few rows of the dataset:

>>>
>>> df.tail()
      last_name first_name   birthday gender type state       party
11970   Garrett     Thomas 1972-03-27      M  rep    VA  Republican
11971    Handel      Karen 1962-04-18      F  rep    GA  Republican
11972     Jones     Brenda 1959-10-24      F  rep    MI    Democrat
11973    Marino        Tom 1952-08-15      M  rep    PA  Republican
11974     Jones     Walter 1943-02-10      M  rep    NC  Republican

The DataFrame uses categorical dtypes for space efficiency:

>>>
>>> df.dtypes
last_name             object
first_name          category
birthday      datetime64[ns]
gender              category
type                category
state               category
party               category
dtype: object

You can see that most columns of the dataset have the type category, which reduces the memory load on your machine.

The “Hello, World!” of Pandas GroupBy

Now that you’re familiar with the dataset, you’ll start with a “Hello, World!” for the Pandas GroupBy operation. What is the count of Congressional members, on a state-by-state basis, over the entire history of the dataset? In SQL, you could find this answer with a SELECT statement:

SELECT state, count(name)
FROM df
GROUP BY state
ORDER BY state;

Here’s the near-equivalent in Pandas:

>>>
>>> n_by_state = df.groupby("state")["last_name"].count()
>>> n_by_state.head(10)
state
AK     16
AL    206
AR    117
AS      2
AZ     48
CA    361
CO     90
CT    240
DC      2
DE     97
Name: last_name, dtype: int64

You call .groupby() and pass the name of the column you want to group on, which is "state". Then, you use ["last_name"] to specify the columns on which you want to perform the actual aggregation.

You can pass a lot more than just a single column name to .groupby() as the first argument. You can also specify any of the following:

  • A list of multiple column names
  • A dict or Pandas Series
  • A NumPy array or Pandas Index, or an array-like iterable of these

Here’s an example of grouping jointly on two columns, which finds the count of Congressional members broken out by state and then by gender:

>>>
>>> df.groupby(["state", "gender"])["last_name"].count()
state  gender
AK     M          16
AL     F           3
       M         203
AR     F           5
       M         112
                ...
WI     M         196
WV     F           1
       M         119
WY     F           2
       M          38
Name: last_name, Length: 104, dtype: int64

The analogous SQL query would look like this:

SELECT state, gender, count(name)
FROM df
GROUP BY state, gender
ORDER BY state, gender;

As you’ll see next, .groupby() and the comparable SQL statements are close cousins, but they’re often not functionally identical.

Pandas GroupBy vs SQL

This is a good time to introduce one prominent difference between the Pandas GroupBy operation and the SQL query above. The result set of the SQL query contains three columns:

  1. state
  2. gender
  3. count

In the Pandas version, the grouped-on columns are pushed into the MultiIndex of the resulting Series by default:

>>>
>>> n_by_state_gender = df.groupby(["state", "gender"])["last_name"].count()
>>> type(n_by_state_gender)
<class 'pandas.core.series.Series'>
>>> n_by_state_gender.index[:5]
MultiIndex([('AK', 'M'),
            ('AL', 'F'),
            ('AL', 'M'),
            ('AR', 'F'),
            ('AR', 'M')],
           names=['state', 'gender'])

To more closely emulate the SQL result and push the grouped-on columns back into columns in the result, you an use as_index=False:

>>>
>>> df.groupby(["state", "gender"], as_index=False)["last_name"].count()
    state gender  last_name
0      AK      F        NaN
1      AK      M       16.0
2      AL      F        3.0
3      AL      M      203.0
4      AR      F        5.0
..    ...    ...        ...
111    WI      M      196.0
112    WV      F        1.0
113    WV      M      119.0
114    WY      F        2.0
115    WY      M       38.0

[116 rows x 3 columns]

This produces a DataFrame with three columns and a RangeIndex, rather than a Series with a MultiIndex. In short, using as_index=False will make your result more closely mimic the default SQL output for a similar operation.

Also note that the SQL queries above explicitly use ORDER BY, whereas .groupby() does not. That’s because .groupby() does this by default through its parameter sort, which is True unless you tell it otherwise:

>>>
>>> # Don't sort results by the sort keys
>>> df.groupby("state", sort=False)["last_name"].count()
state
DE      97
VA     432
SC     251
MD     305
PA    1053
      ...
AK      16
PI      13
VI       4
GU       4
AS       2
Name: last_name, Length: 58, dtype: int64

Next, you’ll dive into the object that .groupby() actually produces.

How Pandas GroupBy Works

Before you get any further into the details, take a step back to look at .groupby() itself:

>>>
>>> by_state = df.groupby("state")
>>> print(by_state)
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x107293278>

What is that DataFrameGroupBy thing? Its .__str__() doesn’t give you much information into what it actually is or how it works. The reason that a DataFrameGroupBy object can be difficult to wrap your head around is that it’s lazy in nature. It doesn’t really do any operations to produce a useful result until you say so.

One term that’s frequently used alongside .groupby() is split-apply-combine. This refers to a chain of three steps:

  1. Split a table into groups
  2. Apply some operations to each of those smaller tables
  3. Combine the results

It can be difficult to inspect df.groupby("state") because it does virtually none of these things until you do something with the resulting object. Again, a Pandas GroupBy object is lazy. It delays virtually every part of the split-apply-combine process until you invoke a method on it.

So, how can you mentally separate the split, apply, and combine stages if you can’t see any of them happening in isolation? One useful way to inspect a Pandas GroupBy object and see the splitting in action is to iterate over it. This is implemented in DataFrameGroupBy.__iter__() and produces an iterator of (group, DataFrame) pairs for DataFrames:

>>>
>>> for state, frame in by_state:
...     print(f"First 2 entries for {state!r}")
...     print("------------------------")
...     print(frame.head(2), end="\n\n")
...
First 2 entries for 'AK'
------------------------
     last_name first_name   birthday gender type state        party
6619    Waskey      Frank 1875-04-20      M  rep    AK     Democrat
6647      Cale     Thomas 1848-09-17      M  rep    AK  Independent

First 2 entries for 'AL'
------------------------
    last_name first_name   birthday gender type state       party
912   Crowell       John 1780-09-18      M  rep    AL  Republican
991    Walker       John 1783-08-12      M  sen    AL  Republican

If you’re working on a challenging aggregation problem, then iterating over the Pandas GroupBy object can be a great way to visualize the split part of split-apply-combine.

There are a few other methods and properties that let you look into the individual groups and their splits. The .groups attribute will give you a dictionary of {group name: group label} pairs. For example, by_state is a dict with states as keys. Here’s the value for the "PA" key:

>>>
>>> by_state.groups["PA"]
Int64Index([    4,    19,    21,    27,    38,    57,    69,    76,    84,
               88,
            ...
            11842, 11866, 11875, 11877, 11887, 11891, 11932, 11945, 11959,
            11973],
           dtype='int64', length=1053)

Each value is a sequence of the index locations for the rows belonging to that particular group. In the output above, 4, 19, and 21 are the first indices in df at which the state equals “PA.”

You can also use .get_group() as a way to drill down to the sub-table from a single group:

>>>
>>> by_state.get_group("PA")
      last_name first_name   birthday gender type state                party
4        Clymer     George 1739-03-16      M  rep    PA                  NaN
19       Maclay    William 1737-07-20      M  sen    PA  Anti-Administration
21       Morris     Robert 1734-01-20      M  sen    PA   Pro-Administration
27      Wynkoop      Henry 1737-03-02      M  rep    PA                  NaN
38       Jacobs     Israel 1726-06-09      M  rep    PA                  NaN
...         ...        ...        ...    ...  ...   ...                  ...
11891     Brady     Robert 1945-04-07      M  rep    PA             Democrat
11932   Shuster       Bill 1961-01-10      M  rep    PA           Republican
11945   Rothfus      Keith 1962-04-25      M  rep    PA           Republican
11959  Costello       Ryan 1976-09-07      M  rep    PA           Republican
11973    Marino        Tom 1952-08-15      M  rep    PA           Republican

This is virtually equivalent to using .loc[]. You could get the same output with something like df.loc[df["state"] == "PA"].

Next, what about the apply part? You can think of this step of the process as applying the same operation (or callable) to every “sub-table” that is produced by the splitting stage. (I don’t know if “sub-table” is the technical term, but I haven’t found a better one 🤷‍♂️)

From the Pandas GroupBy object by_state, you can grab the initial U.S. state and DataFrame with next(). When you iterate over a Pandas GroupBy object, you’ll get pairs that you can unpack into two variables:

>>>
>>> state, frame = next(iter(by_state))  # First tuple from iterator
>>> state
'AK'
>>> frame.head(3)
     last_name first_name   birthday gender type state        party
6619    Waskey      Frank 1875-04-20      M  rep    AK     Democrat
6647      Cale     Thomas 1848-09-17      M  rep    AK  Independent
7442   Grigsby     George 1874-12-02      M  rep    AK          NaN

Now, think back to your original, full operation:

>>>
>>> df.groupby("state")["last_name"].count()
state
AK      16
AL     206
AR     117
AS       2
AZ      48
...

The apply stage, when applied to your single, subsetted DataFrame, would look like this:

>>>
>>> frame["last_name"].count()  # Count for state == 'AK'
16

You can see that the result, 16, matches the value for AK in the combined result.

The last step, combine, is the most self-explanatory. It simply takes the results of all of the applied operations on all of the sub-tables and combines them back together in an intuitive way.

Example 2: Air Quality Dataset

The air quality dataset contains hourly readings from a gas sensor device in Italy. Missing values are denoted with -200 in the CSV file. You can use read_csv() to combine two columns into a timestamp while using a subset of the other columns:

import pandas as pd

df = pd.read_csv(
    "groupby-data/airqual.csv",
    parse_dates=[["Date", "Time"]],
    na_values=[-200],
    usecols=["Date", "Time", "CO(GT)", "T", "RH", "AH"]
).rename(
    columns={
        "CO(GT)": "co",
        "Date_Time": "tstamp",
        "T": "temp_c",
        "RH": "rel_hum",
        "AH": "abs_hum",
    }
).set_index("tstamp")

This produces a DataFrame with a DatetimeIndex and four float columns:

>>>
>>> df.head()
                      co  temp_c  rel_hum  abs_hum
tstamp
2004-03-10 18:00:00  2.6    13.6     48.9    0.758
2004-03-10 19:00:00  2.0    13.3     47.7    0.726
2004-03-10 20:00:00  2.2    11.9     54.0    0.750
2004-03-10 21:00:00  2.2    11.0     60.0    0.787
2004-03-10 22:00:00  1.6    11.2     59.6    0.789

Here, co is that hour’s average carbon monoxide reading, while temp_c, rel_hum, and abs_hum are the average temperature in Celsius, relative humidity, and absolute humidity over that hour, respectively. The observations run from March 2004 through April 2005:

>>>
>>> df.index.min()
Timestamp('2004-03-10 18:00:00')
>>> df.index.max()
Timestamp('2005-04-04 14:00:00')

So far, you’ve grouped on columns by specifying their names as str, such as df.groupby("state"). But .groupby() is a whole lot more flexible than this! You’ll see how next.

Grouping on Derived Arrays

Earlier you saw that the first parameter to .groupby() can accept several different arguments:

  • A column or list of columns
  • A dict or Pandas Series
  • A NumPy array or Pandas Index, or an array-like iterable of these

You can take advantage of the last option in order to group by the day of the week. You can use the index’s .day_name() to produce a Pandas Index of strings. Here are the first ten observations:

>>>
>>> day_names = df.index.day_name()
>>> type(day_names)
<class 'pandas.core.indexes.base.Index'>
>>> day_names[:10]
Index(['Wednesday', 'Wednesday', 'Wednesday', 'Wednesday', 'Wednesday',
       'Wednesday', 'Thursday', 'Thursday', 'Thursday', 'Thursday'],
      dtype='object', name='tstamp')

You can then take this object and use it as the .groupby() key. In Pandas-speak, day_names is array-like. It’s a one-dimensional sequence of labels.

Now, pass that object to .groupby() to find the average carbon monoxide ()co) reading by day of the week:

>>>
>>> df.groupby(day_names)["co"].mean()
tstamp
Friday       2.543
Monday       2.017
Saturday     1.861
Sunday       1.438
Thursday     2.456
Tuesday      2.382
Wednesday    2.401
Name: co, dtype: float64

The split-apply-combine process behaves largely the same as before, except that the splitting this time is done on an artificially-created column. This column doesn’t exist in the DataFrame itself, but rather is derived from it.

What if you wanted to group not just by day of the week, but by hour of the day? That result should have 7 * 24 = 168 observations. To accomplish that, you can pass a list of array-like objects. In this case, you’ll pass Pandas Int64Index objects:

>>>
>>> hr = df.index.hour
>>> df.groupby([day_names, hr])["co"].mean().rename_axis(["dow", "hr"])
dow        hr
Friday     0     1.936
           1     1.609
           2     1.172
           3     0.887
           4     0.823
                 ...
Wednesday  19    4.147
           20    3.845
           21    2.898
           22    2.102
           23    1.938
Name: co, Length: 168, dtype: float64

Here’s one more similar case that uses .cut() to bin the temperature values into discrete intervals:

>>>
>>> bins = pd.cut(df["temp_c"], bins=3, labels=("cool", "warm", "hot"))
>>> df[["rel_hum", "abs_hum"]].groupby(bins).agg(["mean", "median"])
       rel_hum        abs_hum
          mean median    mean median
temp_c
cool    57.651   59.2   0.666  0.658
warm    49.383   49.3   1.183  1.145
hot     24.994   24.1   1.293  1.274

In this case, bins is actually a Series:

>>>
>>> type(bins)
<class 'pandas.core.series.Series'>
>>> bins.head()
tstamp
2004-03-10 18:00:00    cool
2004-03-10 19:00:00    cool
2004-03-10 20:00:00    cool
2004-03-10 21:00:00    cool
2004-03-10 22:00:00    cool
Name: temp_c, dtype: category
Categories (3, object): [cool < warm < hot]

Whether it’s a Series, NumPy array, or list doesn’t matter. What’s important is that bins still serves as a sequence of labels, one of cool, warm, or hot. If you really wanted to, then you could also use a Categorical array or even a plain-old list:

  • Native Python list: df.groupby(bins.tolist())
  • Pandas Categorical array: df.groupby(bins.values)

As you can see, .groupby() is smart and can handle a lot of different input types. Any of these would produce the same result because all of them function as a sequence of labels on which to perform the grouping and splitting.

Resampling

You’ve grouped df by the day of the week with df.groupby(day_names)["co"].mean(). Now consider something different. What if you wanted to group by an observation’s year and quarter? Here’s one way to accomplish that:

>>>
>>> # See an easier alternative below
>>> df.groupby([df.index.year, df.index.quarter])["co"].agg(
...     ["max", "min"]
... ).rename_axis(["year", "quarter"])
               max  min
year quarter
2004 1         8.1  0.3
     2         7.3  0.1
     3         7.5  0.1
     4        11.9  0.1
2005 1         8.7  0.1
     2         5.0  0.3

This whole operation can, alternatively, be expressed through resampling. One of the uses of resampling is as a time-based groupby. All that you need to do is pass a frequency string, such as "Q" for "quarterly", and Pandas will do the rest:

>>>
>>> df.resample("Q")["co"].agg(["max", "min"])
             max  min
tstamp
2004-03-31   8.1  0.3
2004-06-30   7.3  0.1
2004-09-30   7.5  0.1
2004-12-31  11.9  0.1
2005-03-31   8.7  0.1
2005-06-30   5.0  0.3

Often, when you use .resample() you can express time-based grouping operations in a much more succinct manner. The result may be a tiny bit different than the more verbose .groupby() equivalent, but you’ll often find that .resample() gives you exactly what you’re looking for.

Example 3: News Aggregator Dataset

Now you’ll work with the third and final dataset, which holds metadata on several hundred thousand news articles and groups them into topic clusters:

import datetime as dt
import pandas as pd

def parse_millisecond_timestamp(ts: int) -> dt.datetime:
    """Convert ms since Unix epoch to UTC datetime instance."""
    return dt.datetime.fromtimestamp(ts / 1000, tz=dt.timezone.utc)

df = pd.read_csv(
    "groupby-data/news.csv",
    sep="\t",
    header=None,
    index_col=0,
    names=["title", "url", "outlet", "category", "cluster", "host", "tstamp"],
    parse_dates=["tstamp"],
    date_parser=parse_millisecond_timestamp,
    dtype={
        "outlet": "category",
        "category": "category",
        "cluster": "category",
        "host": "category",
    },
)

To read it into memory with the proper dyptes, you need a helper function to parse the timestamp column. This is because it’s expressed as the number of milliseconds since the Unix epoch, rather than fractional seconds, which is the convention. Similar to what you did before, you can use the Categorical dtype to efficiently encode columns that have a relatively small number of unique values relative to the column length.

Each row of the dataset contains the title, URL, publishing outlet’s name, and domain, as well as the publish timestamp. cluster is a random ID for the topic cluster to which an article belongs. category is the news category and contains the following options:

  • b for business
  • t for science and technology
  • e for entertainment
  • m for health

Here’s the first row:

>>>
>>> df.iloc[0]
title       Fed official says wea...
url         http://www.latimes.co...
outlet             Los Angeles Times
category                           b
cluster     ddUyU0VZz0BRneMioxUPQ...
host                 www.latimes.com
tstamp      2014-03-10 16:52:50.6...
Name: 1, dtype: object

Now that you’ve had a glimpse of the data, you can begin to ask more complex questions about it.

Using Lambda Functions in .groupby()

This dataset invites a lot more potentially involved questions. I’ll throw a random but meaningful one out there: which outlets talk most about the Federal Reserve? Let’s assume for simplicity that this entails searching for case-sensitive mentions of "Fed". Bear in mind that this may generate some false positives with terms like “Federal Government.”

To count mentions by outlet, you can call .groupby() on the outlet, and then quite literally .apply() a function on each group:

>>>
>>> df.groupby("outlet", sort=False)["title"].apply(
...     lambda ser: ser.str.contains("Fed").sum()
... ).nlargest(10)
outlet
Reuters                         161
NASDAQ                          103
Businessweek                     93
Investing.com                    66
Wall Street Journal \(blog\)     61
MarketWatch                      56
Moneynews                        55
Bloomberg                        53
GlobalPost                       51
Economic Times                   44
Name: title, dtype: int64

Let’s break this down since there are several method calls made in succession. Like before, you can pull out the first group and its corresponding Pandas object by taking the first tuple from the Pandas GroupBy iterator:

>>>
>>> title, ser = next(iter(df.groupby("outlet", sort=False)["title"]))
>>> title
'Los Angeles Times'
>>> ser.head()
1       Fed official says weak data caused by weather,...
486            Stocks fall on discouraging news from Asia
1124    Clues to Genghis Khan's rise, written in the r...
1146    Elephants distinguish human voices by sex, age...
1237    Honda splits Acura into its own division to re...
Name: title, dtype: object

In this case, ser is a Pandas Series rather than a DataFrame. That’s because you followed up the .groupby() call with ["title"]. This effectively selects that single column from each sub-table.

Next comes .str.contains("Fed"). This returns a Boolean Series that is True when an article title registers a match on the search. Sure enough, the first row starts with "Fed official says weak data caused by weather,..." and lights up as True:

>>>
>>> ser.str.contains("Fed")
1          True
486       False
1124      False
1146      False
1237      False
          ...
421547    False
421584    False
421972    False
422226    False
422905    False
Name: title, Length: 1976, dtype: bool

The next step is to .sum() this Series. Since bool is technically just a specialized type of int, you can sum a Series of True and False just as you would sum a sequence of 1 and 0:

>>>
>>> ser.str.contains("Fed").sum()
17

The result is the number of mentions of "Fed" by the Los Angeles Times in the dataset. The same routine gets applied for Reuters, NASDAQ, Businessweek, and the rest of the lot.

Improving the Performance of .groupby()

Let’s backtrack again to .groupby(...).apply() to see why this pattern can be suboptimal. To get some background information, check out How to Speed Up Your Pandas Projects. What may happen with .apply() is that it will effectively perform a Python loop over each group. While the .groupby(...).apply() pattern can provide some flexibility, it can also inhibit Pandas from otherwise using its Cython-based optimizations.

All that is to say that whenever you find yourself thinking about using .apply(), ask yourself if there’s a way to express the operation in a vectorized way. In that case, you can take advantage of the fact that .groupby() accepts not just one or more column names, but also many array-like structures:

  • A 1-dimensional NumPy array
  • A list
  • A Pandas Series or Index

Also note that .groupby() is a valid instance method for a Series, not just a DataFrame, so you can essentially inverse the splitting logic. With that in mind, you can first construct a Series of Booleans that indicate whether or not the title contains "Fed":

>>>
>>> mentions_fed = df["title"].str.contains("Fed")
>>> type(mentions_fed)
<class 'pandas.core.series.Series'>

Now, .groupby() is also a method of Series, so you can group one Series on another:

>>>
>>> import numpy as np
>>> mentions_fed.groupby(
...     df["outlet"], sort=False
... ).sum().nlargest(10).astype(np.uintc)
outlet
Reuters                         161
NASDAQ                          103
Businessweek                     93
Investing.com                    66
Wall Street Journal \(blog\)     61
MarketWatch                      56
Moneynews                        55
Bloomberg                        53
GlobalPost                       51
Economic Times                   44
Name: title, dtype: uint32

The two Series don’t need to be columns of the same DataFrame object. They just need to be of the same shape:

>>>
>>> mentions_fed.shape
(422419,)
>>> df["outlet"].shape
(422419,)

Finally, you can cast the result back to an unsigned integer with np.uintc if you’re determined to get the most compact result possible. Here’s a head-to-head comparison of the two versions that will produce the same result:

# Version 1: using `.apply()`
df.groupby("outlet", sort=False)["title"].apply(
    lambda ser: ser.str.contains("Fed").sum()
).nlargest(10)

# Version 2: using vectorization
mentions_fed.groupby(
    df["outlet"], sort=False
).sum().nlargest(10).astype(np.uintc)

On my laptop, Version 1 takes 4.01 seconds, while Version 2 takes just 292 milliseconds. This is an impressive 14x difference in CPU time for a few hundred thousand rows. Consider how dramatic the difference becomes when your dataset grows to a few million rows!

Pandas GroupBy: Putting It All Together

If you call dir() on a Pandas GroupBy object, then you’ll see enough methods there to make your head spin! It can be hard to keep track of all of the functionality of a Pandas GroupBy object. One way to clear the fog is to compartmentalize the different methods into what they do and how they behave.

Broadly, methods of a Pandas GroupBy object fall into a handful of categories:

  1. Aggregation methods (also called reduction methods) “smush” many data points into an aggregated statistic about those data points. An example is to take the sum, mean, or median of 10 numbers, where the result is just a single number.

  2. Filter methods come back to you with a subset of the original DataFrame. This most commonly means using .filter() to drop entire groups based on some comparative statistic about that group and its sub-table. It also makes sense to include under this definition a number of methods that exclude particular rows from each group.

  3. Transformation methods return a DataFrame with the same shape and indices as the original, but with different values. With both aggregation and filter methods, the resulting DataFrame will commonly be smaller in size than the input DataFrame. This is not true of a transformation, which transforms individual values themselves but retains the shape of the original DataFrame.

  4. Meta methods are less concerned with the original object on which you called .groupby(), and more focused on giving you high-level information such as the number of groups and indices of those groups.

  5. Plotting methods mimic the API of plotting for a Pandas Series or DataFrame, but typically break the output into multiple subplots.

The official documentation has its own explanation of these categories. They are, to some degree, open to interpretation, and this tutorial might diverge in slight ways in classifying which method falls where.

You can take a look at a more detailed breakdown of each category and the various methods of .groupby() that fall under them:

Aggregation methods (also called reduction methods) “smush” many data points into an aggregated statistic about those data points. An example is to take the sum, mean, or median of 10 numbers, where the result is just a single number. Here are some aggregation methods:

Filter methods come back to you with a subset of the original DataFrame. This most commonly means using .filter() to drop entire groups based on some comparative statistic about that group and its sub-table. It also makes sense to include under this definition a number of methods that exclude particular rows from each group. Here are some filter methods:

Transformation methods return a DataFrame with the same shape and indices as the original, but with different values. With both aggregation and filter methods, the resulting DataFrame will commonly be smaller in size than the input DataFrame. This is not true of a transformation, which transforms individual values themselves but retains the shape of the original DataFrame. Here are some transformer methods:

Meta methods are less concerned with the original object on which you called .groupby(), and more focused on giving you high-level information such as the number of groups and indices of those groups. Here are some meta methods:

Plotting methods mimic the API of plotting for a Pandas Series or DataFrame, but typically break the output into multiple subplots. Here are some plotting methods:

There are a few methods of Pandas GroupBy objects that don’t fall nicely into the categories above. These methods usually produce an intermediate object that is not a DataFrame or Series. For instance, df.groupby(...).rolling(...) produces a RollingGroupby object, which you can then call aggregation, filter, or transformation methods on:

  • .expanding()
  • .pipe()
  • .resample()
  • .rolling()

Conclusion

In this tutorial, you’ve covered a ton of ground on .groupby(), including its design, its API, and how to chain methods together to get data in an output that suits your purpose.

You’ve learned:

  • How to use Pandas GroupBy operations on real-world data
  • How the split-apply-combine chain of operations works and how you can decompose it into steps
  • How methods of a Pandas GroupBy can be placed into different categories based on their intent and result

There is much more to .groupby() than you can cover in one tutorial. Check out the resources below and use the example datasets here as a starting point for further exploration!

You can download the source code for all the examples in this tutorial by clicking on the link below:

More Resources on Pandas GroupBy

Pandas documentation guides are user-friendly walk-throughs to different aspects of Pandas. Here are some portions of the documentation that you can check out to learn more about Pandas GroupBy:

The API documentation is a fuller technical reference to methods and objects:


[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]



from Real Python
read more

No comments:

Post a Comment

TestDriven.io: Working with Static and Media Files in Django

This article looks at how to work with static and media files in a Django project, locally and in production. from Planet Python via read...