ItsMyCode |
In Python, you will get a valueerror: cannot reindex from a duplicate axis usually when you set an index to a specific value, reindexing or resampling the DataFrame using reindex method.
If you look at the error message “cannot reindex from a duplicate axis“, it means that Pandas DataFrame has duplicate index values. Hence when we do certain operations such as concatenating a DataFrame, reindexing a DataFrame, or resampling a DataFrame in which the index has duplicate values, it will not work, and Python will throw a ValueError.
Verify if your DataFrame Index contains Duplicate values
When you get this error, the first thing you need to do is to check the DataFrame index for duplicate values using the below code.
df.index.is_unique
The index.is_unique
method will return a boolean value. If the index has unique values, it returns True else False.
Test which values in an index is duplicate
If you want to check which values in an index have duplicates, you can use index.duplicated
method as shown below.
df.index.duplicated()
The method returns an array of boolean values. The duplicated values are returned as True in an array.
idx = pd.Index(['lama', 'cow', 'lama', 'beetle', 'lama'])
idx.duplicated()
Output
array([False, False, True, False, True])
Drop rows with duplicate index values
By using the same index.duplicated
method, we can remove the duplicate values in the DataFrame using the following code.
It will traverse the DataFrame from a top-down approach and ensure all the duplicate values in the index are removed, and the unique values are preserved.
df.loc[~df.index.duplicated(), :]
Alternatively, if you use the latest version, you can even use the method df.drop_duplicates() as shown below.
Consider dataset containing ramen rating.
>>> df = pd.DataFrame({
... 'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
... 'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
... 'rating': [4, 4, 3.5, 15, 5]
... })
>>> df
brand style rating
0 Yum Yum cup 4.0
1 Yum Yum cup 4.0
2 Indomie cup 3.5
3 Indomie pack 15.0
4 Indomie pack 5.0
By default, it removes duplicate rows based on all columns.
>>> df.drop_duplicates()
brand style rating
0 Yum Yum cup 4.0
2 Indomie cup 3.5
3 Indomie pack 15.0
4 Indomie pack 5.0
To remove duplicates on specific column(s), use subset
.
>>> df.drop_duplicates(subset=['brand'])
brand style rating
0 Yum Yum cup 4.0
2 Indomie cup 3.5
To remove duplicates and keep last occurrences, use keep
.
>>> df.drop_duplicates(subset=['brand', 'style'], keep='last')
brand style rating
1 Yum Yum cup 4.0
2 Indomie cup 3.5
4 Indomie pack 5.0
Prevent duplicate values in a DataFrame index
If you want to ensure Pandas DataFrame without duplicate values in the index, one can set a flag. Setting the allows_duplicate_labels
flag to False will prevent the assignment of duplicate values.
df.flags.allows_duplicate_labels = False
Applying this flag to a DataFrame with duplicate values or assigning duplicate values will result in DuplicateLabelError: Index has duplicates.
Overwrite DataFrame index with a new one
Alternatively, to overwrite your current DataFrame index with a new one:
df.index = new_index
or, use .reset_index:
df.reset_index(level=0, inplace=True)
Remove inplace=True if you want it to return the dataframe.
The post Python ValueError: cannot reindex from a duplicate axis appeared first on ItsMyCode.
from Planet Python
via read more
No comments:
Post a Comment