How to Remove Duplicates in a pandas DataFrame

In this tutorial, you will learn how to remove rows in a DataFrame when (1) there are duplicate rows, and when (2) there are duplicate value in a column.

TLDR solution

# remove row when entire row is duplicate
df.drop_duplicates()

# remove row when column has a duplicate value
df.drop_duplicates(subset=["column"])

Remove Duplicate Rows in a DataFrame

Suppose, you have the following DataFrame on fish population counts:

import pandas as pd

data = {
  'fish': ['salmon', 'pufferfish', 'pufferfish', 'shark', 'pufferfish',],
  'date': ['2019-06-28', '2019-06-29', '2019-06-29', '2019-06-30', '2019-08-15'],
  'count': [100, 10, 10, 1, 5],
}   

df = pd.DataFrame(data)

print(df)

         fish date_counted  count
0      salmon   2019-06-28    100
1  pufferfish   2019-06-29     10
2  pufferfish   2019-06-29     10
3       shark   2019-06-30      1
4  pufferfish   2019-08-15      5

Note that the third row (index 2) is a duplicate of the second row (index 1). To remove the duplicate, use the drop_duplicates method:

df = df.drop_duplicates

print(df)

         fish date_counted  count
0      salmon   2019-06-28    100
1  pufferfish   2019-06-29     10
3       shark   2019-06-30      1
4  pufferfish   2019-08-15      5

Remove Row When the Column Value is a Duplicate

Let's say, you only want to keep the most recent count per species.

Step 1: Sort DataFrame by Column

You first have sort the DataFrame by date_counted in descending order, so that more recent dates appear first:

import pandas as pd

data = {'fish': ['salmon', 'pufferfish', 'pufferfish', 'shark', 'pufferfish',],
        'date_counted': ['2019-06-28', '2019-06-29', '2019-06-29', '2019-06-30', '2019-08-15'],
        'count': [100, 10, 10, 1, 5],
        }   

df = pd.DataFrame(data)

df = df.sort_values(by='date_counted', ascending=False)

print(df)

         fish date_counted  count
4  pufferfish   2019-08-15      5
3       shark   2019-06-30      1
1  pufferfish   2019-06-29     10
2  pufferfish   2019-06-29     10
0      salmon   2019-06-28    100

Next, use the drop_duplicate method, but, this time, set the subset option:

df = df.drop_duplicates(subset=['fish'])
print(df)

         fish date_counted  count
4  pufferfish   2019-08-15      5
3       shark   2019-06-30      1
0      salmon   2019-06-28    100

That's it! You just learned how to remove duplicates in a DataFrame.