Count Duplicates in Pandas DataFrame

You can count duplicates in Pandas DataFrame using this approach:

df.pivot_table(columns=['DataFrame Column'], aggfunc='size')

In this short guide, you’ll see 3 cases of counting duplicates in Pandas DataFrame:

  1. Under a single column
  2. Across multiple columns
  3. When having NaN values in the DataFrame

3 Cases of Counting Duplicates in Pandas DataFrame

Case 1: count duplicates under a single DataFrame column

Let’s start with a simple case, where you have the following data:

ColorShape
GreenRectangle
GreenRectangle
GreenSquare
BlueRectangle
BlueSquare
RedSquare
RedSquare
RedRectangle

You can capture the above data in Python using Pandas DataFrame:

import pandas as pd

data = {'Color': ['Green', 'Green', 'Green', 'Blue', 'Blue', 'Red', 'Red', 'Red'],
        'Shape': ['Rectangle', 'Rectangle', 'Square', 'Rectangle', 'Square', 'Square', 'Square', 'Rectangle']
        }

df = pd.DataFrame(data)

print(df)

You’ll then get this DataFrame:

   Color      Shape
0  Green  Rectangle
1  Green  Rectangle
2  Green     Square
3   Blue  Rectangle
4   Blue     Square
5    Red     Square
6    Red     Square
7    Red  Rectangle

You may observe the duplicates under both the Color and Shape columns.

You can then count the duplicates under each column using the method introduced at the beginning of this guide:

df.pivot_table(columns=['DataFrame Column'], aggfunc='size')

So this is the complete Python code to get the count of duplicates for the Color column:

import pandas as pd

data = {'Color': ['Green', 'Green', 'Green', 'Blue', 'Blue', 'Red', 'Red', 'Red'],
        'Shape': ['Rectangle', 'Rectangle', 'Square', 'Rectangle', 'Square', 'Square', 'Square', 'Rectangle']
        }

df = pd.DataFrame(data)

duplicates_color = df.pivot_table(columns=['Color'], aggfunc='size')

print(duplicates_color)

And this is the result:

Blue     2
Green    3
Red      3

Alternatively, you can get the count of duplicates for the Shape column using this code:

import pandas as pd

data = {'Color': ['Green', 'Green', 'Green', 'Blue', 'Blue', 'Red', 'Red', 'Red'],
        'Shape': ['Rectangle', 'Rectangle', 'Square', 'Rectangle', 'Square', 'Square', 'Square', 'Rectangle']
        }

df = pd.DataFrame(data)

duplicates_shape = df.pivot_table(columns=['Shape'], aggfunc='size')

print(duplicates_shape)

You’ll then get 4 duplicates for each shape:

Rectangle    4
Square       4

Case 2: count duplicates across multiple columns

What if you want to count duplicates across multiple columns?

For example, what if you want to count the duplicates across both the Color and Shape columns?

If that’s the case, you can simply add all the columns needed like this:

columns=['Color', 'Shape']

So your full Python code would look as follows:

import pandas as pd

data = {'Color': ['Green', 'Green', 'Green', 'Blue', 'Blue', 'Red', 'Red', 'Red'],
        'Shape': ['Rectangle', 'Rectangle', 'Square', 'Rectangle', 'Square', 'Square', 'Square', 'Rectangle']
        }

df = pd.DataFrame(data)

duplicates_color_and_shape = df.pivot_table(columns=['Color', 'Shape'], aggfunc='size')

print(duplicates_color_and_shape)

Run the code, and you’ll get the count of duplicates across both the Color and Shape columns:

Blue   Rectangle    1
       Square       1
Green  Rectangle    2
       Square       1
Red    Rectangle    1
       Square       2

Case 3: count duplicates when having NaN values in the DataFrame

For the third case, let’s use this dataset that contains NaN values:

values
700
NaN
700
NaN
800
700
800

The DataFrame with the NaN values would look like this:

import pandas as pd
import numpy as np

df = pd.DataFrame({'values': [700, np.nan, 700, np.nan, 800, 700, 800]})

print(df)

Run the code, and you’ll see these NaN values:

   values
0   700.0
1     NaN
2   700.0
3     NaN
4   800.0
5   700.0
6   800.0

You can then apply the same approach to count the duplicates:

import pandas as pd
import numpy as np

df = pd.DataFrame({'values': [700, np.nan, 700, np.nan, 800, 700, 800]})

duplicates_values = df.pivot_table(columns=['values'], aggfunc='size')

print(duplicates_values)

This is the result:

700.0    3
800.0    2

To get a count of the NaN values as well, you can replace those NaN values with any value of your choosing and then perform the count.

For example, let’s replace the NaN values with the expression of ‘NULL’ using fillna:

import pandas as pd
import numpy as np

df = pd.DataFrame({'values': [700, np.nan, 700, np.nan, 800, 700, 800]})

df['values'] = df['values'].fillna('NULL')

duplicates_values = df.pivot_table(columns=['values'], aggfunc='size')

print(duplicates_values)

You’ll then get a count of 2 for the NULL entries, which really represent the 2 NaN values:

700.0    3
800.0    2
NULL     2

Leave a Comment