Wednesday, October 21, 2020

Stack Abuse: How to Iterate over Rows in a Pandas DataFrame

Introduction

Pandas is an immensely popular data manipulation framework for Python. In a lot of cases, you might want to iterate over data - either to print it out, or perform some operations on it.

In this tutorial, we'll take a look at how to iterate over rows in a Pandas DataFrame.

If you're new to Pandas, you can read our beginner's tutorial. Once you're familiar, let's look at the three main ways to iterate over DataFrame:

  • items()
  • iterrows()
  • itertuples()

Iterating DataFrames with items()

Let's set up a DataFrame with some data of fictional people:

import pandas as pd

df = pd.DataFrame({
    'first_name': ['John', 'Jane', 'Marry', 'Victoria', 'Gabriel', 'Layla'],
    'last_name': ['Smith', 'Doe', 'Jackson', 'Smith', 'Brown', 'Martinez'],
    'age': [34, 29, 37, 52, 26, 32]},
    index=['id001', 'id002', 'id003', 'id004', 'id005', 'id006'])

Note that we are using id's as our DataFrame's index. Let's take a look at how the DataFrame looks like:

print(df.to_string())
      first_name last_name  age
id001       John     Smith   34
id002       Jane       Doe   29
id003      Marry   Jackson   37
id004   Victoria     Smith   52
id005    Gabriel     Brown   26
id006      Layla  Martinez   32

Now, to iterate over this DataFrame, we'll use the items() function:

df.items()

This returns a generator:

<generator object DataFrame.items at 0x7f3c064c1900>

We can use this to generate pairs of col_name and data. These pairs will contain a column name and every row of data for that column. Let's loop through column names and their data:

for col_name, data in df.items():
        print("col_name:",col_name, "\ndata:",data)

This results in:

col_name: first_name
data: 
id001        John
id002        Jane
id003       Marry
id004    Victoria
id005     Gabriel
id006       Layla
Name: first_name, dtype: object
col_name: last_name
data: 
id001       Smith
id002         Doe
id003     Jackson
id004       Smith
id005       Brown
id006    Martinez
Name: last_name, dtype: object
col_name: age
data: 
id001    34
id002    29
id003    37
id004    52
id005    26
id006    32
Name: age, dtype: int64

We've successfully iterated over all rows in each column. Notice that the index column stays the same over the iteration, as this is the associated index for the values. If you don't define an index, then Pandas will enumerate the index column accordingly.

We can also print a particular row with passing index number to the data as we do with Python lists:

for col_name, data in df.items():
        print("col_name:",col_name, "\ndata:",data[1])

Note that list index are zero-indexed, so data[1] would refer to the second row. You will see this output:

col_name: first_name 
data: Jane
col_name: last_name 
data: Doe
col_name: age 
data: 29

We can also pass the index value to data.

for col_name, data in df.items():
        print("col_name:",col_name, "\ndata:",data['id002'])

The output would be the same as before:

col_name: first_name
data: Jane
col_name: last_name
data: Doe
col_name: age
data: 29

Iterating DataFrames with iterrows()

While df.items() iterates over the rows in column-wise, doing a cycle for each column, we can use iterrows() to get the entire row-data of an index.

Let's try iterating over the rows with iterrows():

for i, row in df.iterrows():
        print(f"Index: {i}")
        print(f"{row}\n")

In the for loop, i represents the index column (in our case it's id001 will be the first row) and row contains the data for that index in all columns. Our output would look like this:

Index: id001
first_name     John
last_name     Smith
age              34
Name: id001, dtype: object

Index: id002
first_name    Jane
last_name      Doe
age             29
Name: id002, dtype: object

Index: id003
first_name      Marry
last_name     Jackson
age                37
Name: id003, dtype: object

...

Likewise, we can iterate over the rows in a certain column. Simply passing the index number or the column name to the row. For example, we can specify printing the first column of the row using by using:

for i, row in df.iterrows():
        print(f"Index: {i}")
        print(f"{row['0']}")

Or:

for i, row in df.iterrows():
        print(f"Index: {i}")
        print(f"{row['first_name']}")

They both produce this output:

Index: id001
John
Index: id002
Jane
Index: id003
Marry
Index: id004
Victoria
Index: id005
Gabriel
Index: id006
Layla

Iterating DataFrames with itertuples()

The itertuples() function will also return a generator, which generates row values in tuples. Let's try this out:

for row in df.itertuples():
    print(row)

You'll see this in your Python shell:

Pandas(Index='id001', first_name='John', last_name='Smith', age=34)
Pandas(Index='id002', first_name='Jane', last_name='Doe', age=29)
Pandas(Index='id003', first_name='Marry', last_name='Jackson', age=37)
Pandas(Index='id004', first_name='Victoria', last_name='Smith', age=52)
Pandas(Index='id005', first_name='Gabriel', last_name='Brown', age=26)
Pandas(Index='id006', first_name='Layla', last_name='Martinez', age=32)

The itertuples() method has two arguments: index and name.

We can choose not to display index column by setting the index parameter to False:

for row in df.itertuples(index=False):
    print(row)

Our tuples will no longer have the index displayed:

Pandas(first_name='John', last_name='Smith', age=34)
Pandas(first_name='Jane', last_name='Doe', age=29)
Pandas(first_name='Marry', last_name='Jackson', age=37)
Pandas(first_name='Victoria', last_name='Smith', age=52)
Pandas(first_name='Gabriel', last_name='Brown', age=26)
Pandas(first_name='Layla', last_name='Martinez', age=32)

As you've already noticed, this generator yields namedtuples with the default name of Pandas. We can change this by passing People argument to the name parameter. You can choose any name you like, but it's always best to pick names relevant to your data:

for row in df.itertuples(index=False, name='People'):
    print(row)

Now our output would be:

People(first_name='John', last_name='Smith', age=34)
People(first_name='Jane', last_name='Doe', age=29)
People(first_name='Marry', last_name='Jackson', age=37)
People(first_name='Victoria', last_name='Smith', age=52)
People(first_name='Gabriel', last_name='Brown', age=26)
People(first_name='Layla', last_name='Martinez', age=32)

Iteration Performance with Pandas

The official Pandas documentation warns that iteration is a slow process. If you're iterating over a DataFrame to modify the data, vectorization would be a quicker alternative. Also, it's discouraged to modify data while iterating over rows as Pandas sometimes returns a copy of the data in the row and not its reference, which means that not all data will actually be changed.

For small datasets you can use the to_string() method to display all the data. For larger datasets that have many columns and rows, you can use head(n) or tail(n) methods to print out the first n rows of your DataFrame (the default value for n is 5).

Speed Comparison

To measure the speed of each particular method, we wrapped them into functions that would execute them for 1000 times and return the average time of execution.

To test these methods, we will use both of the print() and list.append() functions to provide better comparison data and to cover common use cases. In order to decide a fair winner, we will iterate over DataFrame and use only 1 value to print or append per loop.

Here's how the return values look like for each method:

For example, while items() would cycle column by column:

('first_name', 
id001        John
id002        Jane
id003       Marry
id004    Victoria
id005     Gabriel
id006       Layla
Name: first_name, dtype: object)

iterrows() would provide all column data for a particular row:

('id001', 
first_name     John
last_name     Smith
age              34
Name: id001, dtype: object)

And finally, a single row for the itertuples() would look like this:

Pandas(Index='id001', first_name='John', last_name='Smith', age=34)

Here are the average results in seconds:

Method Speed (s) Test Function
items() 1.349279541666571 print()
iterrows() 3.4104003086661883 print()
itertuples() 0.41232967500279 print()
Method Speed (s) Test Function
items() 0.006637570998767235 append()
iterrows() 0.5749766406661365 append()
itertuples() 0.3058610513350383 append()

Printing values will take more time and resource than appending in general and our examples are no exceptions. While itertuples() performs better when combined with print(), items() method outperforms others dramatically when used for append() and iterrows() remains the last for each comparison.

Please note that these test results highly depend on other factors like OS, environment, computational resources, etc. The size of your data will also have an impact on your results.

Conclusion

We've learned how to iterate over the DataFrame with three different Pandas methods - items(), iterrows(), itertuples(). Depending on your data and preferences you can use one of them in your projects.



from Planet Python
via read more

1 comment:

  1. You have done a good job with your knowledge that makes our work easy because you are providing such good information. Keep sharing this kind of knowledge with us. indian customs export data

    ReplyDelete

TestDriven.io: Working with Static and Media Files in Django

This article looks at how to work with static and media files in a Django project, locally and in production. from Planet Python via read...