How to Remove Duplicates in Pandas DataFrame

To remove duplicates across the entire DataFrame:

df.drop_duplicates()

To remove duplicates under a single DataFrame column:

df.drop_duplicates(subset=["column_name"])

Steps to Remove Duplicates in Pandas DataFrame

Step 1: Gather the data that contains the duplicates

Firstly, you’ll need to gather the data that contains the duplicates.

Here is an example of data that contains duplicates:

ColorShape
GreenRectangle
GreenRectangle
GreenSquare
BlueRectangle
BlueSquare
RedSquare
RedSquare
RedRectangle

Step 2: Create the Pandas DataFrame

Next, create the Pandas DataFrame using this code:

import pandas as pd

data = {"Color": ["Green", "Green", "Green", "Blue", "Blue", "Red", "Red", "Red"],
        "Shape": ["Rectangle", "Rectangle", "Square", "Rectangle", "Square", "Square", "Square", "Rectangle"]
        }

df = pd.DataFrame(data)

print(df)

The resulted DataFrame:

   Color      Shape
0  Green  Rectangle
1  Green  Rectangle
2  Green     Square
3   Blue  Rectangle
4   Blue     Square
5    Red     Square
6    Red     Square
7    Red  Rectangle

Step 3: Remove duplicates in Pandas DataFrame

To remove the duplicates across the entire DataFrame using df.drop_duplicates():

import pandas as pd

data = {"Color": ["Green", "Green", "Green", "Blue", "Blue", "Red", "Red", "Red"],
        "Shape": ["Rectangle", "Rectangle", "Square", "Rectangle", "Square", "Square", "Square", "Rectangle"]
        }

df = pd.DataFrame(data)

df_no_duplicates = df.drop_duplicates()

print(df_no_duplicates)

The result after removing the duplicates:

   Color      Shape
0  Green  Rectangle
2  Green     Square
3   Blue  Rectangle
4   Blue     Square
5    Red     Square
7    Red  Rectangle

Remove Duplicates under a Specific Column

To remove the duplicates under the Color column using df.drop_duplicates(subset=[“Color”]):

import pandas as pd

data = {"Color": ["Green", "Green", "Green", "Blue", "Blue", "Red", "Red", "Red"],
        "Shape": ["Rectangle", "Rectangle", "Square", "Rectangle", "Square", "Square", "Square", "Rectangle"]
        }

df = pd.DataFrame(data)

df_no_duplicates = df.drop_duplicates(subset=["Color"])

print(df_no_duplicates)

The result:

   Color      Shape
0  Green  Rectangle
3   Blue  Rectangle
5    Red     Square

You may want to check the Pandas Documentation to learn more about removing duplicates from a DataFrame.

Leave a Comment