Mastering Python Libraries for Messy Datasets

Cleaning messy datasets is an essential aspect of data analysis and manipulation. Python offers powerful libraries such as Pandas and NumPy, which enable efficient cleaning and processing of data.

In this guide, you’ll see the steps to clean messy datasets utilizing Pandas and NumPy.

Steps

Step 1: Installation

If you haven’t already done so, install the Pandas package:

pip install pandas

And the NumPy package:

pip install numpy

Step 2: Creating a Messy Dataset

Now create the following DataFrame that contains messy data, such as missing values, duplicates and so on:

import pandas as pd
import numpy as np

data = {
"product": ["computer", "monitor", "desk", "chair", "chair", "keyboard"],
"price": [1000, 300, np.nan, 150, 150, 50],
"quantity": [10, 5, 8, 3, 3, np.nan],
"manufacturer": ["A", "B", "C", "D", "D", "A"],
"category": ["electronics", "electronics", "furniture", "furniture", "furniture", "electronics"],
}

df = pd.DataFrame(data)

print(df)

The result:

    product   price  quantity  manufacturer     category
0  computer  1000.0      10.0             A  electronics
1   monitor   300.0       5.0             B  electronics
2      desk     NaN       8.0             C    furniture
3     chair   150.0       3.0             D    furniture
4     chair   150.0       3.0             D    furniture
5  keyboard    50.0       NaN             A  electronics

Step 3: Handling Missing Values

Next, fill the missing values in the “price” and “quantity” columns with their respective means:

import pandas as pd
import numpy as np

data = {
"product": ["computer", "monitor", "desk", "chair", "chair", "keyboard"],
"price": [1000, 300, np.nan, 150, 150, 50],
"quantity": [10, 5, 8, 3, 3, np.nan],
"manufacturer": ["A", "B", "C", "D", "D", "A"],
"category": ["electronics", "electronics", "furniture", "furniture", "furniture", "electronics"],
}

df = pd.DataFrame(data)

# Calculate the Means
mean_price = df["price"].mean()
mean_quantity = round(df["quantity"].mean())

# Fill the missing values
df["price"].fillna(mean_price, inplace=True)
df["quantity"].fillna(mean_quantity, inplace=True)

print(df)

The result:

    product   price  quantity  manufacturer     category
0  computer  1000.0      10.0             A  electronics
1   monitor   300.0       5.0             B  electronics
2      desk   330.0       8.0             C    furniture
3     chair   150.0       3.0             D    furniture
4     chair   150.0       3.0             D    furniture
5  keyboard    50.0       6.0             A  electronics

Step 3: Remove Duplicates

To remove duplicate rows from the DataFrame (such as the duplicated “chair” row):

import pandas as pd
import numpy as np

data = {
"product": ["computer", "monitor", "desk", "chair", "chair", "keyboard"],
"price": [1000, 300, np.nan, 150, 150, 50],
"quantity": [10, 5, 8, 3, 3, np.nan],
"manufacturer": ["A", "B", "C", "D", "D", "A"],
"category": ["electronics", "electronics", "furniture", "furniture", "furniture", "electronics"],
}

df = pd.DataFrame(data)

# Calculate the Means
mean_price = df["price"].mean()
mean_quantity = round(df["quantity"].mean())

# Fill the missing values
df["price"].fillna(mean_price, inplace=True)
df["quantity"].fillna(mean_quantity, inplace=True)

# Remove duplicate rows
df.drop_duplicates(inplace=True)

print(df)

The result:

    product   price  quantity  manufacturer     category
0  computer  1000.0      10.0             A  electronics
1   monitor   300.0       5.0             B  electronics
2      desk   330.0       8.0             C    furniture
3     chair   150.0       3.0             D    furniture
5  keyboard    50.0       6.0             A  electronics

Step 6: Data Transformation

To convert the “quantity” column to integers, and add a new column “total_price” that calculates the total price for each product:

import pandas as pd
import numpy as np

data = {
"product": ["computer", "monitor", "desk", "chair", "chair", "keyboard"],
"price": [1000, 300, np.nan, 150, 150, 50],
"quantity": [10, 5, 8, 3, 3, np.nan],
"manufacturer": ["A", "B", "C", "D", "D", "A"],
"category": ["electronics", "electronics", "furniture", "furniture", "furniture", "electronics"],
}

df = pd.DataFrame(data)

# Calculate the Means
mean_price = df["price"].mean()
mean_quantity = round(df["quantity"].mean())

# Fill the missing values
df["price"].fillna(mean_price, inplace=True)
df["quantity"].fillna(mean_quantity, inplace=True)

# Remove duplicate rows
df.drop_duplicates(inplace=True)

# Convert the quantity column to integers and add the total_price column
df["quantity"] = df["quantity"].astype(int)
df["total_price"] = df["price"] * df["quantity"]

print(df)

The result:

    product   price  quantity  manufacturer     category  total_price
0  computer  1000.0        10             A  electronics      10000.0
1   monitor   300.0         5             B  electronics       1500.0
2      desk   330.0         8             C    furniture       2640.0
3     chair   150.0         3             D    furniture        450.0
5  keyboard    50.0         6             A  electronics        300.0

Step 7: Data Aggregation

To calculate the total quantity and average price for each manufacturer:

import pandas as pd
import numpy as np

data = {
"product": ["computer", "monitor", "desk", "chair", "chair", "keyboard"],
"price": [1000, 300, np.nan, 150, 150, 50],
"quantity": [10, 5, 8, 3, 3, np.nan],
"manufacturer": ["A", "B", "C", "D", "D", "A"],
"category": ["electronics", "electronics", "furniture", "furniture", "furniture", "electronics"],
}

df = pd.DataFrame(data)

# Calculate the Means
mean_price = df["price"].mean()
mean_quantity = round(df["quantity"].mean())

# Fill the missing values
df["price"].fillna(mean_price, inplace=True)
df["quantity"].fillna(mean_quantity, inplace=True)

# Remove duplicate rows
df.drop_duplicates(inplace=True)

# Convert the quantity column to integers and add the total_price column
df["quantity"] = df["quantity"].astype(int)
df["total_price"] = df["price"] * df["quantity"]

# Calculate the total quantity and average price
manufacturer_stats = (
df.groupby("manufacturer").agg({"quantity": "sum", "price": "mean"}).reset_index()
)

print(manufacturer_stats)

The result:

  manufacturer  quantity  price
0            A        16  525.0
1            B         5  300.0
2            C         8  330.0
3            D         3  150.0

Additional Resources

Leave a Comment