Cleaning messy datasets is an essential aspect of data analysis and manipulation. Python offers powerful libraries such as Pandas and NumPy, which enable efficient cleaning and processing of data.
In this guide, you’ll see the steps to clean messy datasets utilizing Pandas and NumPy.
Steps
Step 1: Installation
If you haven’t already done so, install the Pandas package:
pip install pandas
And the NumPy package:
pip install numpy
Step 2: Creating a Messy Dataset
Now create the following DataFrame that contains messy data, such as missing values, duplicates and so on:
import pandas as pd
import numpy as np
data = {
"product": ["computer", "monitor", "desk", "chair", "chair", "keyboard"],
"price": [1000, 300, np.nan, 150, 150, 50],
"quantity": [10, 5, 8, 3, 3, np.nan],
"manufacturer": ["A", "B", "C", "D", "D", "A"],
"category": ["electronics", "electronics", "furniture", "furniture", "furniture", "electronics"],
}
df = pd.DataFrame(data)
print(df)
The result:
product price quantity manufacturer category
0 computer 1000.0 10.0 A electronics
1 monitor 300.0 5.0 B electronics
2 desk NaN 8.0 C furniture
3 chair 150.0 3.0 D furniture
4 chair 150.0 3.0 D furniture
5 keyboard 50.0 NaN A electronics
Step 3: Handling Missing Values
Next, fill the missing values in the “price” and “quantity” columns with their respective means:
import pandas as pd
import numpy as np
data = {
"product": ["computer", "monitor", "desk", "chair", "chair", "keyboard"],
"price": [1000, 300, np.nan, 150, 150, 50],
"quantity": [10, 5, 8, 3, 3, np.nan],
"manufacturer": ["A", "B", "C", "D", "D", "A"],
"category": ["electronics", "electronics", "furniture", "furniture", "furniture", "electronics"],
}
df = pd.DataFrame(data)
# Calculate the Means
mean_price = df["price"].mean()
mean_quantity = round(df["quantity"].mean())
# Fill the missing values
df["price"].fillna(mean_price, inplace=True)
df["quantity"].fillna(mean_quantity, inplace=True)
print(df)
The result:
product price quantity manufacturer category
0 computer 1000.0 10.0 A electronics
1 monitor 300.0 5.0 B electronics
2 desk 330.0 8.0 C furniture
3 chair 150.0 3.0 D furniture
4 chair 150.0 3.0 D furniture
5 keyboard 50.0 6.0 A electronics
Step 3: Remove Duplicates
To remove duplicate rows from the DataFrame (such as the duplicated “chair” row):
import pandas as pd
import numpy as np
data = {
"product": ["computer", "monitor", "desk", "chair", "chair", "keyboard"],
"price": [1000, 300, np.nan, 150, 150, 50],
"quantity": [10, 5, 8, 3, 3, np.nan],
"manufacturer": ["A", "B", "C", "D", "D", "A"],
"category": ["electronics", "electronics", "furniture", "furniture", "furniture", "electronics"],
}
df = pd.DataFrame(data)
# Calculate the Means
mean_price = df["price"].mean()
mean_quantity = round(df["quantity"].mean())
# Fill the missing values
df["price"].fillna(mean_price, inplace=True)
df["quantity"].fillna(mean_quantity, inplace=True)
# Remove duplicate rows
df.drop_duplicates(inplace=True)
print(df)
The result:
product price quantity manufacturer category
0 computer 1000.0 10.0 A electronics
1 monitor 300.0 5.0 B electronics
2 desk 330.0 8.0 C furniture
3 chair 150.0 3.0 D furniture
5 keyboard 50.0 6.0 A electronics
Step 6: Data Transformation
To convert the “quantity” column to integers, and add a new column “total_price” that calculates the total price for each product:
import pandas as pd
import numpy as np
data = {
"product": ["computer", "monitor", "desk", "chair", "chair", "keyboard"],
"price": [1000, 300, np.nan, 150, 150, 50],
"quantity": [10, 5, 8, 3, 3, np.nan],
"manufacturer": ["A", "B", "C", "D", "D", "A"],
"category": ["electronics", "electronics", "furniture", "furniture", "furniture", "electronics"],
}
df = pd.DataFrame(data)
# Calculate the Means
mean_price = df["price"].mean()
mean_quantity = round(df["quantity"].mean())
# Fill the missing values
df["price"].fillna(mean_price, inplace=True)
df["quantity"].fillna(mean_quantity, inplace=True)
# Remove duplicate rows
df.drop_duplicates(inplace=True)
# Convert the quantity column to integers and add the total_price column
df["quantity"] = df["quantity"].astype(int)
df["total_price"] = df["price"] * df["quantity"]
print(df)
The result:
product price quantity manufacturer category total_price
0 computer 1000.0 10 A electronics 10000.0
1 monitor 300.0 5 B electronics 1500.0
2 desk 330.0 8 C furniture 2640.0
3 chair 150.0 3 D furniture 450.0
5 keyboard 50.0 6 A electronics 300.0
Step 7: Data Aggregation
To calculate the total quantity and average price for each manufacturer:
import pandas as pd
import numpy as np
data = {
"product": ["computer", "monitor", "desk", "chair", "chair", "keyboard"],
"price": [1000, 300, np.nan, 150, 150, 50],
"quantity": [10, 5, 8, 3, 3, np.nan],
"manufacturer": ["A", "B", "C", "D", "D", "A"],
"category": ["electronics", "electronics", "furniture", "furniture", "furniture", "electronics"],
}
df = pd.DataFrame(data)
# Calculate the Means
mean_price = df["price"].mean()
mean_quantity = round(df["quantity"].mean())
# Fill the missing values
df["price"].fillna(mean_price, inplace=True)
df["quantity"].fillna(mean_quantity, inplace=True)
# Remove duplicate rows
df.drop_duplicates(inplace=True)
# Convert the quantity column to integers and add the total_price column
df["quantity"] = df["quantity"].astype(int)
df["total_price"] = df["price"] * df["quantity"]
# Calculate the total quantity and average price
manufacturer_stats = (
df.groupby("manufacturer").agg({"quantity": "sum", "price": "mean"}).reset_index()
)
print(manufacturer_stats)
The result:
manufacturer quantity price
0 A 16 525.0
1 B 5 300.0
2 C 8 330.0
3 D 3 150.0