Introduction
Pandas is an immensely popular data manipulation framework for Python. In a lot of cases, you might want to iterate over data - either to print it out, or perform some operations on it.
In this tutorial, we'll take a look at how to iterate over rows in a Pandas DataFrame
.
If you're new to Pandas, you can read our beginner's tutorial. Once you're familiar, let's look at the three main ways to iterate over DataFrame:
items()
iterrows()
itertuples()
Iterating DataFrames with items()
Let's set up a DataFrame
with some data of fictional people:
import pandas as pd
df = pd.DataFrame({
'first_name': ['John', 'Jane', 'Marry', 'Victoria', 'Gabriel', 'Layla'],
'last_name': ['Smith', 'Doe', 'Jackson', 'Smith', 'Brown', 'Martinez'],
'age': [34, 29, 37, 52, 26, 32]},
index=['id001', 'id002', 'id003', 'id004', 'id005', 'id006'])
Note that we are using id's as our DataFrame
's index. Let's take a look at how the DataFrame
looks like:
print(df.to_string())
first_name last_name age
id001 John Smith 34
id002 Jane Doe 29
id003 Marry Jackson 37
id004 Victoria Smith 52
id005 Gabriel Brown 26
id006 Layla Martinez 32
Now, to iterate over this DataFrame
, we'll use the items()
function:
df.items()
This returns a generator:
<generator object DataFrame.items at 0x7f3c064c1900>
We can use this to generate pairs of col_name
and data
. These pairs will contain a column name and every row of data for that column. Let's loop through column names and their data:
for col_name, data in df.items():
print("col_name:",col_name, "\ndata:",data)
This results in:
col_name: first_name
data:
id001 John
id002 Jane
id003 Marry
id004 Victoria
id005 Gabriel
id006 Layla
Name: first_name, dtype: object
col_name: last_name
data:
id001 Smith
id002 Doe
id003 Jackson
id004 Smith
id005 Brown
id006 Martinez
Name: last_name, dtype: object
col_name: age
data:
id001 34
id002 29
id003 37
id004 52
id005 26
id006 32
Name: age, dtype: int64
We've successfully iterated over all rows in each column. Notice that the index column stays the same over the iteration, as this is the associated index for the values. If you don't define an index, then Pandas will enumerate the index column accordingly.
We can also print a particular row with passing index number to the data
as we do with Python lists:
for col_name, data in df.items():
print("col_name:",col_name, "\ndata:",data[1])
Note that list index are zero-indexed, so data[1]
would refer to the second row. You will see this output:
col_name: first_name
data: Jane
col_name: last_name
data: Doe
col_name: age
data: 29
We can also pass the index value to data
.
for col_name, data in df.items():
print("col_name:",col_name, "\ndata:",data['id002'])
The output would be the same as before:
col_name: first_name
data: Jane
col_name: last_name
data: Doe
col_name: age
data: 29
Iterating DataFrames with iterrows()
While df.items()
iterates over the rows in column-wise, doing a cycle for each column, we can use iterrows()
to get the entire row-data of an index.
Let's try iterating over the rows with iterrows()
:
for i, row in df.iterrows():
print(f"Index: {i}")
print(f"{row}\n")
In the for loop, i
represents the index column (in our case it's id001
will be the first row) and row
contains the data for that index in all columns. Our output would look like this:
Index: id001
first_name John
last_name Smith
age 34
Name: id001, dtype: object
Index: id002
first_name Jane
last_name Doe
age 29
Name: id002, dtype: object
Index: id003
first_name Marry
last_name Jackson
age 37
Name: id003, dtype: object
...
Likewise, we can iterate over the rows in a certain column. Simply passing the index number or the column name to the row
. For example, we can specify printing the first column of the row using by using:
for i, row in df.iterrows():
print(f"Index: {i}")
print(f"{row['0']}")
Or:
for i, row in df.iterrows():
print(f"Index: {i}")
print(f"{row['first_name']}")
They both produce this output:
Index: id001
John
Index: id002
Jane
Index: id003
Marry
Index: id004
Victoria
Index: id005
Gabriel
Index: id006
Layla
Iterating DataFrames with itertuples()
The itertuples()
function will also return a generator, which generates row values in tuples. Let's try this out:
for row in df.itertuples():
print(row)
You'll see this in your Python shell:
Pandas(Index='id001', first_name='John', last_name='Smith', age=34)
Pandas(Index='id002', first_name='Jane', last_name='Doe', age=29)
Pandas(Index='id003', first_name='Marry', last_name='Jackson', age=37)
Pandas(Index='id004', first_name='Victoria', last_name='Smith', age=52)
Pandas(Index='id005', first_name='Gabriel', last_name='Brown', age=26)
Pandas(Index='id006', first_name='Layla', last_name='Martinez', age=32)
The itertuples()
method has two arguments: index
and name
.
We can choose not to display index column by setting the index
parameter to False
:
for row in df.itertuples(index=False):
print(row)
Our tuples will no longer have the index displayed:
Pandas(first_name='John', last_name='Smith', age=34)
Pandas(first_name='Jane', last_name='Doe', age=29)
Pandas(first_name='Marry', last_name='Jackson', age=37)
Pandas(first_name='Victoria', last_name='Smith', age=52)
Pandas(first_name='Gabriel', last_name='Brown', age=26)
Pandas(first_name='Layla', last_name='Martinez', age=32)
As you've already noticed, this generator yields namedtuples with the default name of Pandas
. We can change this by passing People
argument to the name
parameter. You can choose any name you like, but it's always best to pick names relevant to your data:
for row in df.itertuples(index=False, name='People'):
print(row)
Now our output would be:
People(first_name='John', last_name='Smith', age=34)
People(first_name='Jane', last_name='Doe', age=29)
People(first_name='Marry', last_name='Jackson', age=37)
People(first_name='Victoria', last_name='Smith', age=52)
People(first_name='Gabriel', last_name='Brown', age=26)
People(first_name='Layla', last_name='Martinez', age=32)
Iteration Performance with Pandas
The official Pandas documentation warns that iteration is a slow process. If you're iterating over a DataFrame
to modify the data, vectorization would be a quicker alternative. Also, it's discouraged to modify data while iterating over rows as Pandas sometimes returns a copy of the data in the row and not its reference, which means that not all data will actually be changed.
For small datasets you can use the to_string()
method to display all the data. For larger datasets that have many columns and rows, you can use head(n)
or tail(n)
methods to print out the first n
rows of your DataFrame (the default value for n
is 5).
Speed Comparison
To measure the speed of each particular method, we wrapped them into functions that would execute them for 1000 times and return the average time of execution.
To test these methods, we will use both of the print()
and list.append()
functions to provide better comparison data and to cover common use cases. In order to decide a fair winner, we will iterate over DataFrame and use only 1 value to print or append per loop.
Here's how the return values look like for each method:
For example, while items()
would cycle column by column:
('first_name',
id001 John
id002 Jane
id003 Marry
id004 Victoria
id005 Gabriel
id006 Layla
Name: first_name, dtype: object)
iterrows()
would provide all column data for a particular row:
('id001',
first_name John
last_name Smith
age 34
Name: id001, dtype: object)
And finally, a single row for the itertuples()
would look like this:
Pandas(Index='id001', first_name='John', last_name='Smith', age=34)
Here are the average results in seconds:
Method | Speed (s) | Test Function |
items() | 1.349279541666571 | print() |
iterrows() | 3.4104003086661883 | print() |
itertuples() | 0.41232967500279 | print() |
Method | Speed (s) | Test Function |
items() | 0.006637570998767235 | append() |
iterrows() | 0.5749766406661365 | append() |
itertuples() | 0.3058610513350383 | append() |
Printing values will take more time and resource than appending in general and our examples are no exceptions. While itertuples()
performs better when combined with print()
, items()
method outperforms others dramatically when used for append()
and iterrows()
remains the last for each comparison.
Please note that these test results highly depend on other factors like OS, environment, computational resources, etc. The size of your data will also have an impact on your results.
Conclusion
We've learned how to iterate over the DataFrame with three different Pandas methods - items()
, iterrows()
, itertuples()
. Depending on your data and preferences you can use one of them in your projects.
from Planet Python
via read more
You have done a good job with your knowledge that makes our work easy because you are providing such good information. Keep sharing this kind of knowledge with us. indian customs export data
ReplyDelete