Sunday, October 31, 2021

ItsMyCode: Python ValueError: cannot reindex from a duplicate axis

ItsMyCode |

In Python, you will get a valueerror: cannot reindex from a duplicate axis usually when you set an index to a specific value, reindexing or resampling the DataFrame using reindex method.

If you look at the error message “cannot reindex from a duplicate axis“, it means that Pandas DataFrame has duplicate index values. Hence when we do certain operations such as concatenating a DataFrame, reindexing a DataFrame, or resampling a DataFrame in which the index has duplicate values, it will not work, and Python will throw a ValueError.

Verify if your DataFrame Index contains Duplicate values

When you get this error, the first thing you need to do is to check the DataFrame index for duplicate values using the below code.

df.index.is_unique

The index.is_unique method will return a boolean value. If the index has unique values, it returns True else False.

Test which values in an index is duplicate

If you want to check which values in an index have duplicates, you can use index.duplicated method as shown below.

df.index.duplicated()

The method returns an array of boolean values. The duplicated values are returned as True in an array.

idx = pd.Index(['lama', 'cow', 'lama', 'beetle', 'lama'])
idx.duplicated()

Output

array([False, False,  True, False,  True])

Drop rows with duplicate index values

By using the same index.duplicated method, we can remove the duplicate values in the DataFrame using the following code.

It will traverse the DataFrame from a top-down approach and ensure all the duplicate values in the index are removed, and the unique values are preserved.

df.loc[~df.index.duplicated(), :]

Alternatively, if you use the latest version, you can even use the method df.drop_duplicates() as shown below.

Consider dataset containing ramen rating.

>>> df = pd.DataFrame({
...     'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
...     'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
...     'rating': [4, 4, 3.5, 15, 5]
... })
>>> df
    brand style  rating
0  Yum Yum   cup     4.0
1  Yum Yum   cup     4.0
2  Indomie   cup     3.5
3  Indomie  pack    15.0
4  Indomie  pack     5.0

By default, it removes duplicate rows based on all columns.

>>> df.drop_duplicates()
    brand style  rating
0  Yum Yum   cup     4.0
2  Indomie   cup     3.5
3  Indomie  pack    15.0
4  Indomie  pack     5.0

To remove duplicates on specific column(s), use subset.

>>> df.drop_duplicates(subset=['brand'])
    brand style  rating
0  Yum Yum   cup     4.0
2  Indomie   cup     3.5

To remove duplicates and keep last occurrences, use keep.

>>> df.drop_duplicates(subset=['brand', 'style'], keep='last')
    brand style  rating
1  Yum Yum   cup     4.0
2  Indomie   cup     3.5
4  Indomie  pack     5.0

Prevent duplicate values in a DataFrame index

If you want to ensure Pandas DataFrame without duplicate values in the index, one can set a flag. Setting the allows_duplicate_labels flag to False will prevent the assignment of duplicate values.

df.flags.allows_duplicate_labels = False

Applying this flag to a DataFrame with duplicate values or assigning duplicate values will result in DuplicateLabelError: Index has duplicates.

Overwrite DataFrame index with a new one

Alternatively, to overwrite your current DataFrame index with a new one:

df.index = new_index

or, use .reset_index:

df.reset_index(level=0, inplace=True)

Remove inplace=True if you want it to return the dataframe.

The post Python ValueError: cannot reindex from a duplicate axis appeared first on ItsMyCode.



from Planet Python
via read more

No comments:

Post a Comment

TestDriven.io: Working with Static and Media Files in Django

This article looks at how to work with static and media files in a Django project, locally and in production. from Planet Python via read...