Every weekday, I share a new "pandas trick" on social media. Each trick takes only a minute to read, yet you'll learn something new that will save you time and energy in the future!
Here's my latest trick:
πΌπ€Ή♂️ pandas trick:
— Kevin Markham (@justmarkham) September 4, 2019
Want to combine the output of an aggregation with the original DataFrame?
Instead of: df.groupby('col1').col2.func()
Use: df.groupby('col1').col2.transform(func)
"transform" changes the output shape
See example π#Python #DataScience #pandas #pandastricks pic.twitter.com/9dkcAGpTYK
Want to read the 59 tricks that I've already posted? See below π
Want to see the daily trick in your social media feed? Follow me on Twitter, Facebook, LinkedIn, and YouTube
Want to watch a live demo of my top 25 tricks? Watch this video π₯
Want to support daily pandas tricks? Become a Data School Insider π
Categories
- Reading files
- Creating example DataFrames
- Renaming columns
- Selecting rows and columns
- Filtering rows by condition
- Manipulating strings
- Working with data types
- Encoding data
- Extracting data from lists
- Working with time series data
- Handling missing values
- Using aggregation functions
- Random sampling
- Merging DataFrames
- Styling DataFrames
- Exploring a dataset
- Other
Reading files
πΌπ€Ή♂️ pandas trick:
— Kevin Markham (@justmarkham) August 19, 2019
5 useful "read_csv" parameters that are often overlooked:
➡️ names: specify column names
➡️ usecols: which columns to keep
➡️ dtype: specify data types
➡️ nrows: # of rows to read
➡️ na_values: strings to recognize as NaN#Python #DataScience #pandastricks
πΌπ€Ή♂️ pandas trick:
— Kevin Markham (@justmarkham) September 3, 2019
⚠️ Got bad data (or empty rows) at the top of your CSV file? Use these read_csv parameters:
➡️ header = row number of header (start counting at 0)
➡️ skiprows = list of row numbers to skip
See example π#Python #DataScience #pandas #pandastricks pic.twitter.com/t1M6XkkPYG
πΌπ€Ή♂️ pandas trick:
— Kevin Markham (@justmarkham) June 21, 2019
Two easy ways to reduce DataFrame memory usage:
1. Only read in columns you need
2. Use 'category' data type with categorical data.
Example:
df = https://t.co/Ib52aQAdkA_csv('file.csv', usecols=['A', 'C', 'D'], dtype={'D':'category'})#Python #pandastricks
πΌπ€Ή♂️ pandas trick:
— Kevin Markham (@justmarkham) July 4, 2019
You can read directly from a compressed file:
df = https://t.co/Ib52aQAdkA_csv('https://ift.tt/2MXF51D)
Or write to a compressed file:https://t.co/ySXYEf6MjY_csv('https://ift.tt/2MXF51D)
Also supported: .gz, .bz2, .xz#Python #pandas #pandastricks
πΌπ€Ή♂️ pandas trick:
— Kevin Markham (@justmarkham) June 20, 2019
Are your dataset rows spread across multiple files, but you need a single DataFrame?
Solution:
1. Use glob() to list your files
2. Use a generator expression to read files and concat() to combine them
3. π₯³
See example π#Python #DataScience #pandastricks pic.twitter.com/qtKpzEoSC3
πΌπ€Ή♂️ pandas trick:
— Kevin Markham (@justmarkham) July 15, 2019
Need to quickly get data from Excel or Google Sheets into pandas?
1. Copy data to clipboard
2. df = https://t.co/Ib52aQAdkA_clipboard()
3. π₯³
See example π
Learn 25 more tips & tricks: https://t.co/6akbxXG6SI#Python #DataScience #pandas #pandastricks pic.twitter.com/M2Yw0NAXRe
Creating example DataFrames
πΌπ€Ή♂️ pandas trick:
— Kevin Markham (@justmarkham) June 28, 2019
Need to create an example DataFrame? Here are 3 easy options:
pd.DataFrame({'col_one':[10, 20], 'col_two':[30, 40]})
pd.DataFrame(np.random.rand(2, 3), columns=list('abc'))
pd.util.testing.makeMixedDataFrame()
See output π#Python #pandas #pandastricks pic.twitter.com/SSlZsd6OEj
πΌπ€Ή♂️ pandas trick:
— Kevin Markham (@justmarkham) July 10, 2019
Need to create a DataFrame for testing?
pd.util.testing.makeDataFrame() ➡️ contains random values
.makeMissingDataframe() ➡️ some values missing
.makeTimeDataFrame() ➡️ has DateTimeIndex
.makeMixedDataFrame() ➡️ mixed data types#Python #pandas #pandastricks
Renaming columns
πΌπ€Ή♂️ pandas trick:
— Kevin Markham (@justmarkham) July 16, 2019
3 ways to rename columns:
1. Most flexible option:
df = df.rename({'A':'a', 'B':'b'}, axis='columns')
2. Overwrite all column names:
df.columns = ['a', 'b']
3. Apply string method:
df.columns = df.columns.str.lower()#Python #DataScience #pandastricks
πΌπ€Ή♂️ pandas trick:
— Kevin Markham (@justmarkham) June 11, 2019
Add a prefix to all of your column names:
df.add_prefix('X_')
Add a suffix to all of your column names:
df.add_suffix('_Y')#Python #DataScience
πΌπ€Ή♂️ pandas trick:
— Kevin Markham (@justmarkham) June 25, 2019
Need to rename all of your columns in the same way? Use a string method:
Replace spaces with _:
df.columns = df.columns.str.replace(' ', '_')
Make lowercase & remove trailing whitespace:
df.columns = df.columns.str.lower().str.rstrip()#Python #pandastricks
Selecting rows and columns
πΌπ€Ή♂️ pandas trick:
— Kevin Markham (@justmarkham) July 3, 2019
Need to select multiple rows/columns? "loc" is usually the solution:
select a slice (inclusive):
df.loc[0:4, 'col_A':'col_D']
select a list:
df.loc[[0, 3], ['col_A', 'col_C']]
select by condition:
df.loc[df.col_A=='val', 'col_D']#Python #pandastricks
πΌπ€Ή♂️ pandas trick:
— Kevin Markham (@justmarkham) August 1, 2019
"loc" selects by label, and "iloc" selects by position.
But what if you need to select by label *and* position? You can still use loc or iloc!
See example π
P.S. Don't use "ix", it has been deprecated since 2017.#Python #DataScience #pandas #pandastricks pic.twitter.com/SpFkjWYEE0
πΌπ€Ή♂️ pandas trick:
— Kevin Markham (@justmarkham) June 12, 2019
Reverse column order in a DataFrame:
df.loc[:, ::-1]
Reverse row order:
df.loc[::-1]
Reverse row order and reset the index:
df.loc[::-1].reset_index(drop=True)
Want more #pandastricks? Working on a video right now, stay tuned... π₯#Python #DataScience
Filtering rows by condition
πΌπ€Ή♂️ pandas trick:
— Kevin Markham (@justmarkham) June 13, 2019
Filter DataFrame by multiple OR conditions:
df[(df.color == 'red') | (df.color == 'green') | (df.color == 'blue')]
Shorter way:
df[df.color.isin(['red', 'green', 'blue'])]
Invert the filter:
df[~df.color.isin(['red', 'green', 'blue'])]#Python #pandastricks
πΌπ€Ή♂️ pandas trick:
— Kevin Markham (@justmarkham) August 28, 2019
Are you trying to filter a DataFrame using lots of criteria? It can be hard to write ✏️ and to read! π
Instead, save the criteria as objects and use them to filter. Or, use reduce() to combine the criteria!
See example π#Python #DataScience #pandastricks pic.twitter.com/U9NV27RIjQ
πΌπ€Ή♂️ pandas trick:
— Kevin Markham (@justmarkham) July 25, 2019
Want to filter a DataFrame that doesn't have a name?
Use the query() method to avoid creating an intermediate variable!
See example π#Python #DataScience #pandas #pandastricks pic.twitter.com/NyUOOSr7Sc
πΌπ€Ή♂️ pandas trick:
— Kevin Markham (@justmarkham) August 13, 2019
Need to refer to a local variable within a query() string? Just prefix it with the @ symbol!
See example π#Python #DataScience #pandas #pandastricks pic.twitter.com/PfXcASWDdC
πΌπ€Ή♂️ pandas trick:
— Kevin Markham (@justmarkham) July 30, 2019
If you want to use query() on a column name containing a space, just surround it with backticks! (New in pandas 0.25)
See example π#Python #DataScience #pandas #pandastricks pic.twitter.com/M5ZSRVr3no
Manipulating strings
πΌπ€Ή♂️ pandas trick:
— Kevin Markham (@justmarkham) August 22, 2019
Want to concatenate two string columns?
Option 1: Use a string method π§Ά
Option 2: Use plus signs ➕
See example π
Which option do you prefer, and why?#Python #DataScience #pandas #pandastricks pic.twitter.com/SsjBAMqkxB
πΌπ€Ή♂️ pandas trick:
— Kevin Markham (@justmarkham) July 9, 2019
Need to split a string into multiple columns? Use str.split() method, expand=True to return a DataFrame, and assign it to the original DataFrame.
See example π#Python #DataScience #pandas #pandastricks pic.twitter.com/wZ4okQZ9Dy
Working with data types
πΌπ€Ή♂️ pandas trick:
— Kevin Markham (@justmarkham) June 17, 2019
Numbers stored as strings? Try astype():
df.astype({'col1':'int', 'col2':'float'})
But it will fail if you have any invalid input. Better way:
df.apply(https://t.co/H90jtE9QMp_numeric, errors='coerce')
Converts invalid input to NaN π#Python #pandastricks
πΌπ€Ή♂️ pandas trick:
— Kevin Markham (@justmarkham) June 14, 2019
Select columns by data type:https://t.co/8c3VWfaERD_dtypes(include='number')https://t.co/8c3VWfaERD_dtypes(include=['number', 'category', 'object'])https://t.co/8c3VWfaERD_dtypes(exclude=['datetime', 'timedelta'])#Python #DataScience #pandas #pandastricks
πΌπ€Ή♂️ pandas trick:
— Kevin Markham (@justmarkham) August 8, 2019
Two useful properties of ordered categories:
1️⃣ You can sort the values in logical (not alphabetical) order
2️⃣ Comparison operators also work logically
See example π#Python #DataScience #pandas #pandastricks pic.twitter.com/HeYZ3P3gPP
Encoding data
πΌπ€Ή♂️ pandas trick:
— Kevin Markham (@justmarkham) July 2, 2019
Need to convert a column from continuous to categorical? Use cut():
df['age_groups'] = pd.cut(df.age, bins=[0, 18, 65, 99], labels=['child', 'adult', 'elderly'])
0 to 18 ➡️ 'child'
18 to 65 ➡️ 'adult'
65 to 99 ➡️ 'elderly'#Python #pandas #pandastricks
πΌπ€Ή♂️ pandas trick:
— Kevin Markham (@justmarkham) August 5, 2019
Want to dummy encode (or "one hot encode") your DataFrame? Use pd.get_dummies(df) to encode all object & category columns.
Want to drop the first level since it provides redundant info? Set drop_first=True.
See example & read thread π#Python #pandastricks pic.twitter.com/g0XjJ44eg2
πΌπ€Ή♂️ pandas trick:
— Kevin Markham (@justmarkham) August 30, 2019
Need to apply the same mapping to multiple columns at once? Use "applymap" (DataFrame method) with "get" (dictionary method).
See example π#Python #DataScience #pandas #pandastricks pic.twitter.com/WU4AmeHP4O
Extracting data from lists
πΌπ€Ή♂️ pandas trick:
— Kevin Markham (@justmarkham) June 27, 2019
Has your data ever been TRAPPED in a Series of Python lists? π
Expand the Series into a DataFrame by using apply() and passing it the Series constructor π
See example π#Python #DataScience #pandas #pandastricks pic.twitter.com/ZvysqaRz6S
πΌπ€Ή♂️ pandas trick:
— Kevin Markham (@justmarkham) August 12, 2019
Do you have a Series containing lists of items? Create one row for each item using the "explode" method π₯
New in pandas 0.25! See example π
π€―#Python #DataScience #pandas #pandastricks pic.twitter.com/ix5d8CLg57
πΌπ€Ή♂️ pandas trick:
— Kevin Markham (@justmarkham) August 14, 2019
Does your Series contain comma-separated items? Create one row for each item:
✂️ "str.split" creates a list of strings
⬅️ "assign" overwrites the existing column
π₯ "explode" creates the rows (new in pandas 0.25)
See example π#Python #pandas #pandastricks pic.twitter.com/OqZNWdarP0
πΌπ€Ή♂️ pandas trick:
— Kevin Markham (@justmarkham) August 16, 2019
π₯ "explode" takes a list of items and creates one row for each item (new in pandas 0.25)
You can also do the reverse! See example π
Thanks to @EForEndeavour for this tip π#Python #DataScience #pandas #pandastricks pic.twitter.com/4UBxbzHS51
Working with time series data
πΌπ€Ή♂️ pandas trick:
— Kevin Markham (@justmarkham) July 8, 2019
If you need to create a single datetime column from multiple columns, you can use to_datetime() π
See example π
You must include: month, day, year
You can also include: hour, minute, second#Python #DataScience #pandas #pandastricks pic.twitter.com/0bip6SRDdF
πΌπ€Ή♂️ pandas trick:
— Kevin Markham (@justmarkham) August 2, 2019
One reason to use the datetime data type is that you can access many useful attributes via "dt", like:
df.column.dt.hour
Other attributes include: year, month, day, dayofyear, week, weekday, quarter, days_in_month...
See full list π#Python #pandastricks pic.twitter.com/z405STKqKY
πΌπ€Ή♂️ pandas trick:
— Kevin Markham (@justmarkham) July 18, 2019
Need to perform an aggregation (sum, mean, etc) with a given frequency (monthly, yearly, etc)?
Use resample! It's like a "groupby" for time series data. See example π
"Y" means yearly. See list of frequencies: https://t.co/oPDx85yqFT#Python #pandastricks pic.twitter.com/nweqbHXEtd
πΌπ€Ή♂️ pandas trick:
— Kevin Markham (@justmarkham) August 27, 2019
Want to calculate the difference between each row and the previous row? Use df.col_name.diff()
Want to calculate the percentage change instead? Use df.col_name.pct_change()
See example π#Python #DataScience #pandas #pandastricks pic.twitter.com/5EGYqpNPC3
πΌπ€Ή♂️ pandas trick:
— Kevin Markham (@justmarkham) July 31, 2019
Need to convert a datetime Series from UTC to another time zone?
1. Set current time zone ➡️ tz_localize('UTC')
2. Convert ➡️ tz_convert('America/Chicago')
Automatically handles Daylight Savings Time!
See example π#Python #DataScience #pandastricks pic.twitter.com/ztzMXcgkFY
Handling missing values
πΌπ€Ή♂️ pandas trick:
— Kevin Markham (@justmarkham) June 19, 2019
Calculate % of missing values in each column:
df.isna().mean()
Drop columns with any missing values:
df.dropna(axis='columns')
Drop columns in which more than 10% of values are missing:
df.dropna(thresh=len(df)*0.9, axis='columns')#Python #pandastricks
πΌπ€Ή♂️ pandas trick:
— Kevin Markham (@justmarkham) July 12, 2019
Need to fill missing values in your time series data? Use df.interpolate()
Defaults to linear interpolation, but many other methods are supported!
Want more pandas tricks? Watch this:
π https://t.co/6akbxXXHKg π#Python #DataScience #pandas #pandastricks pic.twitter.com/JjH08dvjMK
πΌπ€Ή♂️ pandas trick:
— Kevin Markham (@justmarkham) August 15, 2019
Do you need to store missing values ("NaN") in an integer Series? Use the "Int64" data type!
See example π
(New in v0.24, API is experimental/subject to change)#Python #DataScience #pandas #pandastricks pic.twitter.com/mN7Ud53Rls
Using aggregation functions
πΌπ€Ή♂️ pandas trick:
— Kevin Markham (@justmarkham) July 19, 2019
Instead of aggregating by a single function (such as 'mean'), you can aggregate by multiple functions by using 'agg' (and passing it a list of functions) or by using 'describe' (for summary statistics π)
See example π#Python #DataScience #pandastricks pic.twitter.com/Emg3zLAocB
πΌπ€Ή♂️ pandas trick:
— Kevin Markham (@justmarkham) August 9, 2019
Did you know that "last" is an aggregation function, just like "sum" and "mean"?
Can be used with a groupby to extract the last value in each group. See example π
P.S. You can also use "first" and "nth" functions!#Python #DataScience #pandas #pandastricks pic.twitter.com/WKJtNIUxwz
πΌπ€Ή♂️ pandas trick:
— Kevin Markham (@justmarkham) August 21, 2019
Are you applying multiple aggregations after a groupby? Try "named aggregation":
✅ Allows you to name the output columns
❌ Avoids a column MultiIndex
New in pandas 0.25! See example π#Python #DataScience #pandas #pandastricks pic.twitter.com/VXJz6ShZbc
πΌπ€Ή♂️ pandas trick:
— Kevin Markham (@justmarkham) September 4, 2019
Want to combine the output of an aggregation with the original DataFrame?
Instead of: df.groupby('col1').col2.func()
Use: df.groupby('col1').col2.transform(func)
"transform" changes the output shape
See example π#Python #DataScience #pandas #pandastricks pic.twitter.com/9dkcAGpTYK
Random sampling
πΌπ€Ή♂️ pandas trick:
— Kevin Markham (@justmarkham) August 20, 2019
Randomly sample rows from a DataFrame:
df.sample(n=10)
df.sample(frac=0.25)
Useful parameters:
➡️ random_state: use any integer for reproducibility
➡️ replace: sample with replacement
➡️ weights: weight based on values in a column π#Python #pandastricks pic.twitter.com/j2AyoTLRKb
πΌπ€Ή♂️ pandas trick:
— Kevin Markham (@justmarkham) August 26, 2019
Want to shuffle your DataFrame rows?
df.sample(frac=1, random_state=0)
Want to reset the index after shuffling?
df.sample(frac=1, random_state=0).reset_index(drop=True)#Python #DataScience #pandas #pandastricks
πΌπ€Ή♂️ pandas trick:
— Kevin Markham (@justmarkham) June 18, 2019
Split a DataFrame into two random subsets:
df_1 = df.sample(frac=0.75, random_state=42)
df_2 = df.drop(df_1.index)
(Only works if df's index values are unique)
P.S. Working on a video of my 25 best #pandastricks, stay tuned! πΊ#Python #pandas #DataScience
Merging DataFrames
πΌπ€Ή♂️ pandas trick:
— Kevin Markham (@justmarkham) July 23, 2019
When you are merging DataFrames, you can identify the source of each row (left/right/both) by setting indicator=True.
See example π
P.S. Learn 25 more #pandastricks in 25 minutes: https://t.co/6akbxXG6SI#Python #DataScience #pandas pic.twitter.com/tkb2LiV4eh
πΌπ€Ή♂️ pandas trick:
— Kevin Markham (@justmarkham) June 26, 2019
Merging datasets? Check that merge keys are unique in BOTH datasets:
pd.merge(left, right, validate='one_to_one')
✅ Use 'one_to_many' to only check uniqueness in LEFT
✅ Use 'many_to_one' to only check uniqueness in RIGHT#Python #DataScience #pandastricks
Styling DataFrames
πΌπ€Ή♂️ pandas trick:
— Kevin Markham (@justmarkham) August 6, 2019
Two simple ways to style a DataFrame:
1️⃣ https://t.co/HRqLVf3cWC.hide_index()
2️⃣ https://t.co/HRqLVf3cWC.set_caption('My caption')
See example π
For more style options, watch trick #25: https://t.co/6akbxXG6SI πΊ#Python #DataScience #pandas #pandastricks pic.twitter.com/8yzyQYz9vr
πΌπ€Ή♂️ pandas trick:
— Kevin Markham (@justmarkham) July 17, 2019
Want to add formatting to your DataFrame? For example:
- hide the index
- add a caption
- format numbers & dates
- highlight min & max values
Watch π to learn how!
Code: https://t.co/HKroWYVIEs
25 more tricks: https://t.co/6akbxXG6SI#Python #pandastricks pic.twitter.com/AKQr7zVR7S
Exploring a dataset
πΌπ€Ή♂️ pandas trick:
— Kevin Markham (@justmarkham) July 29, 2019
Want to explore a new dataset without too much work?
1. Pick one:
➡️ pip install pandas-profiling
➡️ conda install -c conda-forge pandas-profiling
2. import pandas_profiling
3. df.profile_report()
4. π₯³
See example π#Python #DataScience #pandastricks pic.twitter.com/srq5rptEUj
πΌπ€Ή♂️ pandas trick:
— Kevin Markham (@justmarkham) June 24, 2019
Need to check if two Series contain the same elements?
❌ Don't do this:
df.A == df.B
✅ Do this:
df.A.equals(df.B)
✅ Also works for DataFrames:
df.equals(df2)
equals() properly handles NaNs, whereas == does not#Python #DataScience #pandas #pandastricks
πΌπ€Ή♂️ pandas trick:
— Kevin Markham (@justmarkham) July 24, 2019
Want to examine the "head" of a wide DataFrame, but can't see all of the columns?
Solution #1: Change display options to show all columns
Solution #2: Transpose the head (swaps rows and columns)
See example π#Python #DataScience #pandas #pandastricks pic.twitter.com/9sw7O7cPeh
πΌπ€Ή♂️ pandas trick:
— Kevin Markham (@justmarkham) August 23, 2019
Want to plot a DataFrame? It's as easy as:
df.plot(kind='...')
You can use:
line π
bar π
barh
hist
box π¦
kde
area
scatter
hexbin
pie π₯§
Other plot types are available via pd.plotting!
Examples: https://t.co/fXYtPeVpZX#Python #dataviz #pandastricks pic.twitter.com/kp82wA15S4
Other
πΌπ€Ή♂️ pandas trick:
— Kevin Markham (@justmarkham) September 2, 2019
If you've created a groupby object, you can access any of the groups (as a DataFrame) using the get_group() method.
See example π#Python #DataScience #pandas #pandastricks pic.twitter.com/6Ya0kxMpgk
πΌπ€Ή♂️ pandas trick:
— Kevin Markham (@justmarkham) July 1, 2019
Do you have a Series with a MultiIndex?
Reshape it into a DataFrame using the unstack() method. It's easier to read, plus you can interact with it using DataFrame methods!
See example π
P.S. Want a video with my top 25 #pandastricks? πΊ#Python #pandas pic.twitter.com/DKHwN03A7J
πΌπ€Ή pandas trick:
— Kevin Markham (@justmarkham) July 26, 2019
There are many display options you can change:
max_rows
max_columns
max_colwidth
precision
date_dayfirst
date_yearfirst
How to use:
pd.set_option('display.max_rows', 80)
pd.reset_option('display.max_rows')
See all:
pd.describe_option()#Python #pandastricks
πΌπ€Ή♂️ pandas trick:
— Kevin Markham (@justmarkham) July 5, 2019
Show total memory usage of a DataFrame:https://t.co/LkpMP7wWOi(memory_usage='deep')
Show memory used by each column:
df.memory_usage(deep=True)
Need to reduce? Drop unused columns, or convert object columns to 'category' type.#Python #pandas #pandastricks
πΌπ€Ή♂️ pandas trick:
— Kevin Markham (@justmarkham) July 22, 2019
Want to use NumPy without importing it? You can access ALL of its functionality from within pandas! See example π
This is probably *not* a good idea since it breaks with a long-standing convention. But it's a neat trick π#Python #pandas #pandastricks pic.twitter.com/pZbXwuj6Kz
from Planet Python
via read more
No comments:
Post a Comment