Introduction
Working with time-series data is often a challenge on its own. It is a special kind of data, where data points depend on each other across time. When analyzing it, your productivity at gaining insights to a large extent depends on your ability to juggle with the time dimension.
Very often, time-series data are collected over long periods, especially when they come from hardware devices or represent sequences of, for example, financial transactions. Furthermore, even when no field in the dataset is a “null”, the data may still be problematic if the timestamps are not regularly spaced, shifted, missing, or in any way inconsistent.
One of the key skills that help to learn useful information from time-dependent data is to efficiently perform aggregations. Not only does it allow to greatly reduce the total volume of the data, but also helps to spot interesting facts faster.
In this article, I would like to present a few ways how Pandas, the most popular Python library for helping you with analysis, can help you perform these aggregations, and what is so special when you work with time. In addition to that, I will also put an equivalent syntax in SQL for reference.
The data
For demonstration, I use the credit card transaction dataset from Kaggle. However, for simplicity, I focus on the "Amount"
column, and filter it by a single user, although the aggregations can always be extended to include more criteria. Information about time is spread across "Year", "Month", "Day"
, and "Time"
columns, so it makes sense to represent it using a single column instead.
Since the whole dataset weighs around 2.35 GB, let’s transform the data on the fly using smaller batches.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 |
import pandas as pd import numpy as np from tqdm import tqdm from pathlib import Path SRC = Path("data/credit_card_transactions-ibm_v2.csv") DST = Path("data/transactions.csv") USER = 0 def load(filepath=SRC): data = pd.read_csv( filepath, iterator=True, chunksize=10000, usecols=["Year", "Month", "Day", "Time", "Amount"], ) for df in tqdm(data): yield df def process(df): _df = df.query("User == @USER") ts = _df.apply( lambda x: f"{x['Year']}{x['Month']:02d}{x['Day']:02d} {x['Time']}", axis=1, ) _df["timestmap"] = pd.to_datetime(ts) _df["amount"] = df["Amount"].str.strip("$").astype(float) return _df.get(["timestamp", "amount"]) def main(): for i, df in enumerate(load()): df = process(df) df.to_csv( DST, mode="a" if i else "w", header=not(bool(i)), index=False, ) if __name__ == "__main__": main() |
timestamp | amount |
---|---|
2002-09-01 06:21:00 | 134.09 |
2002-09-01 06:42:00 | 38.48 |
2002-09-02 06:22:00 | 120.34 |
2002-09-02 17:45:00 | 128.95 |
2002-09-03 06:23:00 | 104.71 |
The “head” of this frame gives us the above table. For a single user (here USER = 0
), we have almost 20k timestamps that mark transactions between 2002 and 2020 with a one-minute resolution.
Thanks to pd.to_datetime
in line 31, we convert data concatenated from four columns and store it as np.datetime64
variable that describes time in a unified data type.
What is np.datetime64
?
The np.datetime64
(doc) type is a numpy’ed version of pythonic datetime.datetime
object. It is vectorized, therefore making it possible to perform operations over entire arrays quickly. At the same time, the object recognizes typical datetime
methods (doc) that facilitate naturally manipulating the values.
On the pandas side, relevant objects are Timestamp
, Timedelta
, and Period
(with corresponding DatetimeIndex
, TimedeltaIndex
, and PeriodIndex
), which describe moments in time, time shifts, and time spans, respectively. Underneath, however, there are still np.datetime64
s (and similar np.timedelta64
s) with their handy properties.
Converting time-related values to these objects is the best starting point for any time-series analysis. It is convenient, and it is fast.
Basic resampling
The simplest form of a time-series aggregation is to feed values into evenly spaced bins using an aggregating function. It helps to adjust the resolution and the volume of data.
The following snippet shows an example of resampling to days using two functions: sum and count:
1 2 3 4 5 6 |
SELECT sum(amount), count(amount), DATE(timestamp) AS dt FROM transactions GROUP BY dt; |
Pandas provides us at least two ways to achieve the same result:
1 2 3 4 5 |
# option 1 df["amount"].resample("D").agg(["sum", "count"]) # option 2 df["amount"].groupby(pd.Grouper(level=0, freq="D")).agg(["sum", "count"]) |
Both options are equivalent. The first one is simpler and relies on the fact that the timestamp
column has been set to be the dataframe’s index, although it is also possible to use an optional argument on
to point to a particular column. The second uses a more generic aggregation object pd.Grouper
in combination with the .groupby
method. It is highly customizable with many optional arguments. Here, I am using level
as opposed to key
, because timestamp
is an index. Also, freq="D"
stands for days. There are other codes too, although an analogous SQL statement may be more complicated.
Aggregations over several time spans
Say you want to aggregate data over multiple parts of the time stamp such as (year, week)
or (month, day-of-week, hour)
. Due to timestamp
being of np.datetime64
type, it is possible to refer to its methods using the so-called .dt
accessor and use them for aggregation instructions.
In SQL, you would do:
1 2 3 4 5 |
SELECT AVG(amount), STRFTIME('%Y %W', timestamp) AS yearweek FROM transactions GROUP BY yearweek |
Here are two ways to do it in Pandas:
1 2 3 4 5 6 7 8 9 10 11 |
df = df.reset_index() # if we want `timestamp` to be a column df["amount"].groupby(by=[ df["timestamp"].dt.year, df["timestamp"].dt.isocalendar().week ]).mean() df = df.set_index("timestamp") # if we want `timestamp` to be index df["amount"].groupby(by=[ df.index.year, df.index.isocalendar().week, ]).mean() |
They do the same thing.
amount | |
---|---|
(2002, 1) | 40.7375 |
(2002, 35) | 86.285 |
(2002, 36) | 82.3733 |
(2002, 37) | 72.2048 |
(2002, 38) | 91.8647 |
It is also worth mentioning that the .groupby
method does not enforce using an aggregating function. All it does is to slice the frame into a series of frames. You may just as well want to use the individual “sub-frames” and perform some transformations directly on them. If that is the case, just iterate:
1 2 |
for key, group in df.groupby(by=[df.index.year, df.index.isocalendar().week]): pass |
Here the key
will be a tuple of (year, week)
and the group
will be a sub-frame.
Remark
It is important to mention that the boundaries of the time windows may be defined differently in different flavors of SQL and Pandas. When using SQLite for comparison, each gave a slightly different result.
SQL:
1 2 3 4 5 |
SELECT STRFTIME('%Y %W %w', timestamp), timestamp FROM TRANSACTIONS LIMIT 5; |
timestamp | year | week | day |
---|---|---|---|
2002-09-01 06:21:00 | 2002 | 34 | 0 |
2002-09-01 06:42:00 | 2002 | 34 | 0 |
2002-09-02 06:22:00 | 2002 | 35 | 1 |
2002-09-02 17:45:00 | 2002 | 35 | 1 |
2002-09-03 06:23:00 | 2002 | 35 | 2 |
Pandas:
1 |
df.index.isocalendar().head() |
timestamp | year | week | day |
---|---|---|---|
2002-09-01 06:21:00 | 2002 | 35 | 7 |
2002-09-01 06:42:00 | 2002 | 35 | 7 |
2002-09-02 06:22:00 | 2002 | 36 | 1 |
2002-09-02 17:45:00 | 2002 | 36 | 1 |
2002-09-03 06:23:00 | 2002 | 36 | 2 |
The concept is the same, but the reference is different.
Window functions
The last type of aggregations that is commonly used for time data is to use a rolling window. As opposed to groupby rows by values of some specific columns, this method defines an interval of rows to pick a sub-table, shifts the window, and does it again.
Let’s see an example of calculating a moving average of five consecutive rows (the current plus four into the past). In SQL, the syntax is the following:
1 2 3 4 5 6 7 |
SELECT timestamp, AVG(amount) OVER ( ORDER BY timestamp ROWS BETWEEN 4 PRECEDING AND CURRENT ROW ) rolling_avg FROM transactions; |
Pandas decalres a much simpler syntax:
1 2 3 4 5 6 |
# applying mean immediatiely df["amount"].rolling(5).mean() # accessing the chunks directly for chunk in df["amount"].rolling(5): pass |
Again, in Pandas, different adjustments can be made using optional arguments. The size of the window is dictated by the window
attribute, which in SQL is realized by a sequence of statements (line 5). In addition, we may want to center the window, use a different window e.g. weighted averaging, or perform optional data cleaning. However, the usage of the pd.Rolling
object returned by the .rolling
method is, in a sense, similar to the pd.DataFrameGroupBy
objects.
Conclusions
Here, I presented three types of aggregations I frequently use when working with time-series data. Although not all data the contains information about time is a time-series, for time-series it is almost always beneficial to convert the time information into pd.Timestamp
or other similar objects that implement numpys np.datetime64
objects underneath. As shown, it makes aggregating across differnent time properties very convenient, intuitive, and fun.
from Planet Python
via read more
No comments:
Post a Comment