Need to remove duplicates from Pandas DataFrame?
If so, you can apply the following syntax to remove duplicates from your DataFrame:
df.drop_duplicates()
In the next section, you’ll see the steps to apply this syntax in practice.
Steps to Remove Duplicates from Pandas DataFrame
Step 1: Gather the data that contains the duplicates
Firstly, you’ll need to gather the data that contains the duplicates.
For example, let’s say that you have the following data about boxes, where each box may have a different color or shape:
Color | Shape |
Green | Rectangle |
Green | Rectangle |
Green | Square |
Blue | Rectangle |
Blue | Square |
Red | Square |
Red | Square |
Red | Rectangle |
As you can see, there are duplicates under both columns.
Before you remove those duplicates, you’ll need to create Pandas DataFrame to capture that data in Python.
Step 2: Create Pandas DataFrame
Next, create Pandas DataFrame using this code:
import pandas as pd boxes = {'Color': ['Green','Green','Green','Blue','Blue','Red','Red','Red'], 'Shape': ['Rectangle','Rectangle','Square','Rectangle','Square','Square','Square','Rectangle'] } df = pd.DataFrame(boxes, columns = ['Color', 'Shape']) print(df)
Once you run the code in Python, you’ll get the same values as in step 1:
Color Shape
0 Green Rectangle
1 Green Rectangle
2 Green Square
3 Blue Rectangle
4 Blue Square
5 Red Square
6 Red Square
7 Red Rectangle
Step 3: Remove duplicates from Pandas DataFrame
To remove duplicates from the DataFrame, you may use the following syntax that you saw at the beginning of this guide:
df.drop_duplicates()
Let’s say that you want to remove the duplicates across the two columns of Color and Shape.
In that case, apply the code below in order to remove those duplicates:
import pandas as pd boxes = {'Color': ['Green','Green','Green','Blue','Blue','Red','Red','Red'], 'Shape': ['Rectangle','Rectangle','Square','Rectangle','Square','Square','Square','Rectangle'] } df = pd.DataFrame(boxes, columns = ['Color', 'Shape']) df_duplicates_removed = df.drop_duplicates() print(df_duplicates_removed)
As you can see, only the distinct values across the two columns remain:
Color Shape
0 Green Rectangle
2 Green Square
3 Blue Rectangle
4 Blue Square
5 Red Square
7 Red Rectangle
But what if you want to remove the duplicates on a specific column, such as the Color column?
In that case, you can specify the column name using a subset:
df.drop_duplicates(subset=[‘Color’])
So the full Python code to remove the duplicates for the Color column would look like this:
import pandas as pd boxes = {'Color': ['Green','Green','Green','Blue','Blue','Red','Red','Red'], 'Shape': ['Rectangle','Rectangle','Square','Rectangle','Square','Square','Square','Rectangle'] } df = pd.DataFrame(boxes, columns = ['Color', 'Shape']) df_duplicates_removed = df.drop_duplicates(subset=['Color']) print(df_duplicates_removed)
Here is the result:
Color Shape
0 Green Rectangle
3 Blue Rectangle
5 Red Square
You may want to check the Pandas Documentation to learn more about removing duplicates from a DataFrame.