How to Count Duplicates in Pandas DataFrame

You can count duplicates in pandas DataFrame by using this method:

df.pivot_table(index=['DataFrame Column'], aggfunc='size')

Next, I’ll review the following 3 cases to demonstrate how to count duplicates in pandas DataFrame:

(1) under a single column
(2) across multiple columns
(3) when having NaN values in the DataFrame

3 Cases of Counting Duplicates in Pandas DataFrame

Case 1: count duplicates under a single DataFrame column

Let’s start with a simple case, where you have the following data about boxes:

ColorShape
GreenRectangle
GreenRectangle
GreenSquare
BlueRectangle
BlueSquare
RedSquare
RedSquare
RedRectangle

You can capture this data in Python using pandas DataFrame:

from pandas import DataFrame

Boxes = {'Color': ['Green','Green','Green','Blue','Blue','Red','Red','Red'],
         'Shape': ['Rectangle','Rectangle','Square','Rectangle','Square','Square','Square','Rectangle']
        }

df = DataFrame(Boxes, columns= ['Color','Shape'])
print (df)

You’ll then get this DataFrame:

How to Count Duplicates in Pandas DataFrame

You may observe the duplicate values under both the Color and Shape columns.

You can then count the duplicates under each column using the method introduced at the beginning of this guide:

df.pivot_table(index=['DataFrame Column'], aggfunc='size')

So this is the complete Python code to get the count of duplicates for the Color column:

from pandas import DataFrame

Boxes = {'Color': ['Green','Green','Green','Blue','Blue','Red','Red','Red'],
         'Shape': ['Rectangle','Rectangle','Square','Rectangle','Square','Square','Square','Rectangle']
        }

df = DataFrame(Boxes, columns= ['Color','Shape'])

dups_color = df.pivot_table(index=['Color'], aggfunc='size')
print (dups_color)

And this is the result:

Count Duplicates in Pandas DataFrame

Alternatively, you can get the count of duplicates for the Shape column using this code:

from pandas import DataFrame

Boxes = {'Color': ['Green','Green','Green','Blue','Blue','Red','Red','Red'],
         'Shape': ['Rectangle','Rectangle','Square','Rectangle','Square','Square','Square','Rectangle']
        }

df = DataFrame(Boxes, columns= ['Color','Shape'])

dups_shape = df.pivot_table(index=['Shape'], aggfunc='size')
print (dups_shape)

You’ll then get 4 duplicates for each shape:

How to Count Duplicates in Pandas DataFrame

Case 2: count duplicates across multiple columns

What if you want to count duplicates across multiple columns?

For example, what if you want to count the duplicates values across both the Color and Shape columns?

If that’s the case, you can simply add all the columns needed like this:

index=[‘Color’,’Shape’]

So your full Python code, would look as follows:

from pandas import DataFrame

Boxes = {'Color': ['Green','Green','Green','Blue','Blue','Red','Red','Red'],
         'Shape': ['Rectangle','Rectangle','Square','Rectangle','Square','Square','Square','Rectangle']
        }

df = DataFrame(Boxes, columns= ['Color','Shape'])

dups_color_and_shape = df.pivot_table(index=['Color','Shape'], aggfunc='size')
print (dups_color_and_shape)

Run the code, and you’ll get the count of duplicates across both the Color and Shape columns:

Count Duplicates in Pandas DataFrame

Case 3: count duplicates when having NaN values in the DataFrame

For the third case, let’s use this dataset:

values
700
ABC300
700
900XYZ
800
700
800

The DataFrame in Python would then look like this:

import pandas as pd

df = pd.DataFrame({'values': ['700','ABC300','700','900XYZ','800','700','800']})
print (df)

Dataframe in Python

You may have noticed that currently there are no NaN values within the DataFrame.

Well, not for long…

If you look closely, you’ll see that some of the values in the DataFrame contain text (i.e., ABC300 and 900XYZ), while other values are purely numeric (i.e., 700 and 800).

You can then use to_numeric in order to convert the values in our data-set into a float format. But since two of those values contain text, you’ll get ‘NaN’ for those two values using this code:

import pandas as pd

df = pd.DataFrame({'values': ['700','ABC300','700','900XYZ','800','700','800']})

df['values'] = pd.to_numeric(df['values'], errors='coerce')
print(df)

Run the code and you’ll now see those NaN values:

NaN values

You can then apply the same method to count the duplicates

import pandas as pd

df = pd.DataFrame({'values': ['700','ABC300','700','900XYZ','800','700','800']})

df['values'] = pd.to_numeric(df['values'], errors='coerce')

dups_values = df.pivot_table(index=['values'], aggfunc='size')
print (dups_values)

This is the result:

How to Count Duplicates in Pandas

To get a count for the NaN values as well, you can replace those NaN values with any value of your choosing and then perform the count.

For example, I chose to replace the NaN values with the expression of ‘NULL’ using fillna:

import pandas as pd

df = pd.DataFrame({'values': ['700','ABC300','700','900XYZ','800','700','800']})

df['values'] = pd.to_numeric(df['values'], errors='coerce')
df['values'] = df['values'].fillna('NULL')

dups_values = df.pivot_table(index=['values'], aggfunc='size')
print (dups_values)

You’ll then get a count of 2 for the NULL entries, which really represent the 2 NaN values:

How to Count Duplicates in DataFrame