How to Remove Duplicates from Pandas DataFrame

Looking to remove duplicates from pandas DataFrame?

If so, you can apply the following syntax in Python to remove duplicates from your DataFrame:

 

DataFrame.drop_duplicates (df)

 

In the next section, I’ll show you the steps to apply this syntax in practice.

Steps to Remove Duplicates from Pandas DataFrame

Step 1: Gather the data

Firstly, you’ll need to gather the data that contains the duplicates.

For example, let’s say that you have the following data about boxes, where each box may have a different color or shape:

 

ColorShape
GreenRectangle
GreenRectangle
GreenSquare
BlueRectangle
BlueSquare
RedSquare
RedSquare
RedRectangle

 

Looking at the above data, you can observe that duplicate values exist under both columns.

Before you remove those duplicates, you’ll need to create pandas DataFrame to capture that data in Python.

Step 2: Create Pandas DataFrame

Next, create pandas DataFrame by using this code:

 

from pandas import DataFrame

Boxes = {'Color': ['Green','Green','Green','Blue','Blue','Red','Red','Red'],
         'Shape': ['Rectangle','Rectangle','Square','Rectangle','Square','Square','Square','Rectangle']
        }
df = DataFrame(Boxes, columns= ['Color', 'Shape'])

print (df)

 

Once you run the code in Python, you’ll get the same values as in step 1:

 

How to Remove Duplicates from Pandas DataFrame

Step 3: Remove duplicates from Pandas DataFrame

To remove duplicates from pandas DataFrame, you may use the following syntax that we saw at the beginning of this tutorial:

 

DataFrame.drop_duplicates (df)

 

Let’s say that you want to remove the duplicate values across the two columns of Color and Shape.

If that’s the case, then you may use this code in Python to remove the duplicates:

 

from pandas import DataFrame

Boxes = {'Color': ['Green','Green','Green','Blue','Blue','Red','Red','Red'],
         'Shape': ['Rectangle','Rectangle','Square','Rectangle','Square','Square','Square','Rectangle']
        }
df = DataFrame(Boxes, columns= ['Color', 'Shape'])

df_duplicates_removed = DataFrame.drop_duplicates (df)
print (df_duplicates_removed)

 

As you can see, only the distinct values across the two columns remain:

 

Remove Duplicates from Pandas DataFrame

 

But what if you want to remove the duplicates under a single column?

For example, what if you want to remove the duplicates under the Color column only?

In that case, you should just keep the Color column when assigning the columns in the DataFrame:

df = DataFrame(Boxes, columns= [‘Color’])

So that the full Python code to remove duplicates under the Color column would look like this:

 

from pandas import DataFrame

Boxes = {'Color': ['Green','Green','Green','Blue','Blue','Red','Red','Red'],
         'Shape': ['Rectangle','Rectangle','Square','Rectangle','Square','Square','Square','Rectangle']
        }
df = DataFrame(Boxes, columns= ['Color'])

df_duplicates_removed = DataFrame.drop_duplicates (df)
print (df_duplicates_removed)

 

As you can see, only the distinct values under the Color column remain:

Drop Duplicates from Pandas DataFrame

 

That’s it! You may want to check the pandas documentation to learn more about removing duplicates from a DataFrame.