How to Remove Duplicates in a pandas DataFrame
In this tutorial, you will learn how to remove rows in a DataFrame when (1) there are duplicate rows, and when (2) there are duplicate value in a column.
TLDR solution
# remove row when entire row is duplicate
df.drop_duplicates()
# remove row when column has a duplicate value
df.drop_duplicates(subset=["column"])
Remove Duplicate Rows in a DataFrame
Suppose, you have the following DataFrame on fish population counts:
import pandas as pd
data = {
'fish': ['salmon', 'pufferfish', 'pufferfish', 'shark', 'pufferfish',],
'date': ['2019-06-28', '2019-06-29', '2019-06-29', '2019-06-30', '2019-08-15'],
'count': [100, 10, 10, 1, 5],
}
df = pd.DataFrame(data)
print(df)
fish date_counted count
0 salmon 2019-06-28 100
1 pufferfish 2019-06-29 10
2 pufferfish 2019-06-29 10
3 shark 2019-06-30 1
4 pufferfish 2019-08-15 5
Note that the third row (index 2) is a duplicate of the second row (index 1).
To remove the duplicate, use the drop_duplicates method:
df = df.drop_duplicates
print(df)
fish date_counted count
0 salmon 2019-06-28 100
1 pufferfish 2019-06-29 10
3 shark 2019-06-30 1
4 pufferfish 2019-08-15 5
Remove Row When the Column Value is a Duplicate
Let's say, you only want to keep the most recent count per species.
Step 1: Sort DataFrame by Column
You first have sort the DataFrame by date_counted in descending order, so that more recent dates appear first:
import pandas as pd
data = {'fish': ['salmon', 'pufferfish', 'pufferfish', 'shark', 'pufferfish',],
'date_counted': ['2019-06-28', '2019-06-29', '2019-06-29', '2019-06-30', '2019-08-15'],
'count': [100, 10, 10, 1, 5],
}
df = pd.DataFrame(data)
df = df.sort_values(by='date_counted', ascending=False)
print(df)
fish date_counted count
4 pufferfish 2019-08-15 5
3 shark 2019-06-30 1
1 pufferfish 2019-06-29 10
2 pufferfish 2019-06-29 10
0 salmon 2019-06-28 100
Next, use the drop_duplicate method, but, this time, set the subset option:
df = df.drop_duplicates(subset=['fish'])
print(df)
fish date_counted count
4 pufferfish 2019-08-15 5
3 shark 2019-06-30 1
0 salmon 2019-06-28 100
That's it! You just learned how to remove duplicates in a DataFrame.