You can count duplicates in Pandas DataFrame using this approach:
df.pivot_table(columns=['DataFrame Column'], aggfunc='size')
In this short guide, you’ll see 3 cases of counting duplicates in Pandas DataFrame:
- Under a single column
- Across multiple columns
- When having NaN values in the DataFrame
3 Cases of Counting Duplicates in Pandas DataFrame
Case 1: count duplicates under a single DataFrame column
Let’s start with a simple case, where you have the following data about boxes:
Color | Shape |
Green | Rectangle |
Green | Rectangle |
Green | Square |
Blue | Rectangle |
Blue | Square |
Red | Square |
Red | Square |
Red | Rectangle |
You can capture the above data in Python using Pandas DataFrame:
import pandas as pd boxes = {'Color': ['Green','Green','Green','Blue','Blue','Red','Red','Red'], 'Shape': ['Rectangle','Rectangle','Square','Rectangle','Square','Square','Square','Rectangle'] } df = pd.DataFrame(boxes, columns= ['Color','Shape']) print (df)
You’ll then get this DataFrame:
Color Shape
0 Green Rectangle
1 Green Rectangle
2 Green Square
3 Blue Rectangle
4 Blue Square
5 Red Square
6 Red Square
7 Red Rectangle
You may observe the duplicates under both the Color and Shape columns.
You can then count the duplicates under each column using the method introduced at the beginning of this guide:
df.pivot_table(columns=['DataFrame Column'], aggfunc='size')
So this is the complete Python code to get the count of duplicates for the Color column:
import pandas as pd boxes = {'Color': ['Green','Green','Green','Blue','Blue','Red','Red','Red'], 'Shape': ['Rectangle','Rectangle','Square','Rectangle','Square','Square','Square','Rectangle'] } df = pd.DataFrame(boxes, columns= ['Color','Shape']) dups_color = df.pivot_table(columns=['Color'], aggfunc='size') print (dups_color)
And this is the result:
Blue 2
Green 3
Red 3
Alternatively, you can get the count of duplicates for the Shape column using this code:
import pandas as pd boxes = {'Color': ['Green','Green','Green','Blue','Blue','Red','Red','Red'], 'Shape': ['Rectangle','Rectangle','Square','Rectangle','Square','Square','Square','Rectangle'] } df = pd.DataFrame(boxes, columns= ['Color','Shape']) dups_shape = df.pivot_table(columns=['Shape'], aggfunc='size') print (dups_shape)
You’ll then get 4 duplicates for each shape:
Rectangle 4
Square 4
Case 2: count duplicates across multiple columns
What if you want to count duplicates across multiple columns?
For example, what if you want to count the duplicates across both the Color and Shape columns?
If that’s the case, you can simply add all the columns needed like this:
columns=['Color','Shape']
So your full Python code, would look as follows:
import pandas as pd boxes = {'Color': ['Green','Green','Green','Blue','Blue','Red','Red','Red'], 'Shape': ['Rectangle','Rectangle','Square','Rectangle','Square','Square','Square','Rectangle'] } df = pd.DataFrame(boxes, columns= ['Color','Shape']) dups_color_and_shape = df.pivot_table(columns=['Color','Shape'], aggfunc='size') print (dups_color_and_shape)
Run the code, and you’ll get the count of duplicates across both the Color and Shape columns:
Blue Rectangle 1
Square 1
Green Rectangle 2
Square 1
Red Rectangle 1
Square 2
Case 3: count duplicates when having NaN values in the DataFrame
For the third case, let’s use this dataset that contains NaN values:
values |
700 |
NaN |
700 |
NaN |
800 |
700 |
800 |
The DataFrame with the NaN values would look like this:
import pandas as pd import numpy as np df = pd.DataFrame({'values': [700,np.nan,700,np.nan,800,700,800]}) print(df)
Run the code and you’ll now see those NaN values:
values
0 700.0
1 NaN
2 700.0
3 NaN
4 800.0
5 700.0
6 800.0
You can then apply the same approach to count the duplicates:
import pandas as pd import numpy as np df = pd.DataFrame({'values': [700,np.nan,700,np.nan,800,700,800]}) dups_values = df.pivot_table(columns=['values'], aggfunc='size') print (dups_values)
This is the result:
700.0 3
800.0 2
To get a count of the NaN values as well, you can replace those NaN values with any value of your choosing and then perform the count.
For example, let’s replace the NaN values with the expression of ‘NULL’ using fillna:
import pandas as pd import numpy as np df = pd.DataFrame({'values': [700,np.nan,700,np.nan,800,700,800]}) df['values'] = df['values'].fillna('NULL') dups_values = df.pivot_table(columns=['values'], aggfunc='size') print (dups_values)
You’ll then get a count of 2 for the NULL entries, which really represent the 2 NaN values:
700.0 3
800.0 2
NULL 2