Daily Python: Zero-with-Dot (Oleg Żero): Aggregations on time-series data with Pandas

Introduction

Working with time-series data is often a challenge on its own. It is a special kind of data, where data points depend on each other across time. When analyzing it, your productivity at gaining insights to a large extent depends on your ability to juggle with the time dimension.

Very often, time-series data are collected over long periods, especially when they come from hardware devices or represent sequences of, for example, financial transactions. Furthermore, even when no field in the dataset is a “null”, the data may still be problematic if the timestamps are not regularly spaced, shifted, missing, or in any way inconsistent.

One of the key skills that help to learn useful information from time-dependent data is to efficiently perform aggregations. Not only does it allow to greatly reduce the total volume of the data, but also helps to spot interesting facts faster.

In this article, I would like to present a few ways how Pandas, the most popular Python library for helping you with analysis, can help you perform these aggregations, and what is so special when you work with time. In addition to that, I will also put an equivalent syntax in SQL for reference.

The data

For demonstration, I use the credit card transaction dataset from Kaggle. However, for simplicity, I focus on the "Amount" column, and filter it by a single user, although the aggregations can always be extended to include more criteria. Information about time is spread across "Year", "Month", "Day", and "Time" columns, so it makes sense to represent it using a single column instead.

Since the whole dataset weighs around 2.35 GB, let’s transform the data on the fly using smaller batches.

import pandas as pd
import numpy as np
from tqdm import tqdm
from pathlib import Path


SRC = Path("data/credit_card_transactions-ibm_v2.csv")
DST = Path("data/transactions.csv")
USER = 0


def load(filepath=SRC):
    data = pd.read_csv(
        filepath,
        iterator=True,
        chunksize=10000,
        usecols=["Year", "Month", "Day", "Time", "Amount"],
    )
    
    for df in tqdm(data):
        yield df


def process(df):
    _df = df.query("User == @USER") 
    ts = _df.apply(
        lambda x: f"{x['Year']}{x['Month']:02d}{x['Day']:02d} {x['Time']}",
        axis=1,
    )

    _df["timestmap"] = pd.to_datetime(ts)
    _df["amount"] = df["Amount"].str.strip("$").astype(float)
    
    return _df.get(["timestamp", "amount"])


def main():
    for i, df in enumerate(load()):
        df = process(df)
        df.to_csv(
            DST,
            mode="a" if i else "w",
            header=not(bool(i)),
            index=False,
        )


if __name__ == "__main__":
    main()

timestamp	amount
2002-09-01 06:21:00	134.09
2002-09-01 06:42:00	38.48
2002-09-02 06:22:00	120.34
2002-09-02 17:45:00	128.95
2002-09-03 06:23:00	104.71

The “head” of this frame gives us the above table. For a single user (here USER = 0), we have almost 20k timestamps that mark transactions between 2002 and 2020 with a one-minute resolution.

Thanks to pd.to_datetime in line 31, we convert data concatenated from four columns and store it as np.datetime64 variable that describes time in a unified data type.

What is `np.datetime64`?

The np.datetime64 (doc) type is a numpy’ed version of pythonic datetime.datetime object. It is vectorized, therefore making it possible to perform operations over entire arrays quickly. At the same time, the object recognizes typical datetime methods (doc) that facilitate naturally manipulating the values.

On the pandas side, relevant objects are Timestamp, Timedelta, and Period (with corresponding DatetimeIndex, TimedeltaIndex, and PeriodIndex), which describe moments in time, time shifts, and time spans, respectively. Underneath, however, there are still np.datetime64s (and similar np.timedelta64s) with their handy properties.

Converting time-related values to these objects is the best starting point for any time-series analysis. It is convenient, and it is fast.

Basic resampling

The simplest form of a time-series aggregation is to feed values into evenly spaced bins using an aggregating function. It helps to adjust the resolution and the volume of data.

The following snippet shows an example of resampling to days using two functions: sum and count:

SELECT
    sum(amount),
    count(amount),
    DATE(timestamp) AS dt
FROM transactions
GROUP BY dt;

Pandas provides us at least two ways to achieve the same result:

# option 1
df["amount"].resample("D").agg(["sum", "count"])

# option 2
df["amount"].groupby(pd.Grouper(level=0, freq="D")).agg(["sum", "count"])

Both options are equivalent. The first one is simpler and relies on the fact that the timestamp column has been set to be the dataframe’s index, although it is also possible to use an optional argument on to point to a particular column. The second uses a more generic aggregation object pd.Grouper in combination with the .groupby method. It is highly customizable with many optional arguments. Here, I am using level as opposed to key, because timestamp is an index. Also, freq="D" stands for days. There are other codes too, although an analogous SQL statement may be more complicated.

Aggregations over several time spans

Say you want to aggregate data over multiple parts of the time stamp such as (year, week) or (month, day-of-week, hour). Due to timestamp being of np.datetime64 type, it is possible to refer to its methods using the so-called .dt accessor and use them for aggregation instructions.

In SQL, you would do:

SELECT
    AVG(amount),
    STRFTIME('%Y %W', timestamp) AS yearweek
FROM transactions
GROUP BY yearweek

Here are two ways to do it in Pandas:

df = df.reset_index()  # if we want `timestamp` to be a column
df["amount"].groupby(by=[
    df["timestamp"].dt.year, 
    df["timestamp"].dt.isocalendar().week
]).mean()

df = df.set_index("timestamp")  # if we want `timestamp` to be index
df["amount"].groupby(by=[
    df.index.year,
    df.index.isocalendar().week,
]).mean()

They do the same thing.

	amount
(2002, 1)	40.7375
(2002, 35)	86.285
(2002, 36)	82.3733
(2002, 37)	72.2048
(2002, 38)	91.8647

It is also worth mentioning that the .groupby method does not enforce using an aggregating function. All it does is to slice the frame into a series of frames. You may just as well want to use the individual “sub-frames” and perform some transformations directly on them. If that is the case, just iterate:

1 2	for key, group in df.groupby(by=[df.index.year, df.index.isocalendar().week]): pass

Here the key will be a tuple of (year, week) and the group will be a sub-frame.

Remark

It is important to mention that the boundaries of the time windows may be defined differently in different flavors of SQL and Pandas. When using SQLite for comparison, each gave a slightly different result.

SQL:

SELECT
    STRFTIME('%Y %W %w', timestamp),
    timestamp
FROM TRANSACTIONS
LIMIT 5;

timestamp	year	week	day
2002-09-01 06:21:00	2002	34	0
2002-09-01 06:42:00	2002	34	0
2002-09-02 06:22:00	2002	35	1
2002-09-02 17:45:00	2002	35	1
2002-09-03 06:23:00	2002	35	2

Pandas:

1	df.index.isocalendar().head()

timestamp	year	week	day
2002-09-01 06:21:00	2002	35	7
2002-09-01 06:42:00	2002	35	7
2002-09-02 06:22:00	2002	36	1
2002-09-02 17:45:00	2002	36	1
2002-09-03 06:23:00	2002	36	2

The concept is the same, but the reference is different.

Window functions

The last type of aggregations that is commonly used for time data is to use a rolling window. As opposed to groupby rows by values of some specific columns, this method defines an interval of rows to pick a sub-table, shifts the window, and does it again.

Let’s see an example of calculating a moving average of five consecutive rows (the current plus four into the past). In SQL, the syntax is the following:

SELECT
    timestamp,
    AVG(amount) OVER (
        ORDER BY timestamp
        ROWS BETWEEN 4 PRECEDING AND CURRENT ROW
    ) rolling_avg
FROM transactions;

Pandas decalres a much simpler syntax:

# applying mean immediatiely
df["amount"].rolling(5).mean()

# accessing the chunks directly
for chunk in df["amount"].rolling(5):
    pass

Again, in Pandas, different adjustments can be made using optional arguments. The size of the window is dictated by the window attribute, which in SQL is realized by a sequence of statements (line 5). In addition, we may want to center the window, use a different window e.g. weighted averaging, or perform optional data cleaning. However, the usage of the pd.Rolling object returned by the .rolling method is, in a sense, similar to the pd.DataFrameGroupBy objects.

Conclusions

Here, I presented three types of aggregations I frequently use when working with time-series data. Although not all data the contains information about time is a time-series, for time-series it is almost always beneficial to convert the time information into pd.Timestamp or other similar objects that implement numpys np.datetime64 objects underneath. As shown, it makes aggregating across differnent time properties very convenient, intuitive, and fun.

from Planet Python
via read more

Daily Python

Friday, October 8, 2021

Zero-with-Dot (Oleg Żero): Aggregations on time-series data with Pandas

Introduction

The data

What is `np.datetime64`?

Basic resampling

Aggregations over several time spans

Remark

Window functions

Conclusions

No comments:

Post a Comment

TestDriven.io: Working with Static and Media Files in Django

Search This Blog

Friday, October 8, 2021

Zero-with-Dot (Oleg Żero): Aggregations on time-series data with Pandas

Introduction

The data

What is np.datetime64?

Basic resampling

Aggregations over several time spans

Remark

Window functions

Conclusions

No comments:

Post a Comment

TestDriven.io: Working with Static and Media Files in Django

What is `np.datetime64`?