To get the descriptive statistics for a specific column in your DataFrame:
df["dataframe_column"].describe()
To get the descriptive statistics for an entire DataFrame:
df.describe(include="all")
Steps
Step 1: Collect the Data
To start, collect the data for your DataFrame.
Here is an example of a dataset:
product | price | year |
A | 22000 | 2014 |
B | 27000 | 2015 |
C | 25000 | 2016 |
C | 29000 | 2017 |
D | 35000 | 2018 |
Step 2: Create the DataFrame
Next, create the DataFrame based on the data collected:
import pandas as pd
data = {
"product": ["A", "B", "C", "C", "D"],
"price": [22000, 27000, 25000, 29000, 35000],
"year": [2014, 2015, 2016, 2017, 2018],
}
df = pd.DataFrame(data)
print(df)
Run the code in Python, and you’ll get the following DataFrame:
product price year
0 A 22000 2014
1 B 27000 2015
2 C 25000 2016
3 C 29000 2017
4 D 35000 2018
Step 3: Get the Descriptive Statistics
To get the descriptive statistics for the “price” column, which contains numerical data:
df["price"].describe()
The full code:
import pandas as pd
data = {
"product": ["A", "B", "C", "C", "D"],
"price": [22000, 27000, 25000, 29000, 35000],
"year": [2014, 2015, 2016, 2017, 2018],
}
df = pd.DataFrame(data)
stats_numeric = df["price"].describe()
print(stats_numeric)
The resulted descriptive statistics for the “price” column:
count 5.000000
mean 27600.000000
std 4878.524367
min 22000.000000
25% 25000.000000
50% 27000.000000
75% 29000.000000
max 35000.000000
Name: price, dtype: float64
Notice that the output contains 6 decimal places. You can convert the values to integers using astype(int):
import pandas as pd
data = {
"product": ["A", "B", "C", "C", "D"],
"price": [22000, 27000, 25000, 29000, 35000],
"year": [2014, 2015, 2016, 2017, 2018],
}
df = pd.DataFrame(data)
stats_numeric = df["price"].describe().astype(int)
print(stats_numeric)
Run the code, and you’ll get only integers:
count 5
mean 27600
std 4878
min 22000
25% 25000
50% 27000
75% 29000
max 35000
Name: price, dtype: int32
Descriptive Statistics for Categorical Data
To get the descriptive statistics for the “product” column, which contains categorical data:
import pandas as pd
data = {
"product": ["A", "B", "C", "C", "D"],
"price": [22000, 27000, 25000, 29000, 35000],
"year": [2014, 2015, 2016, 2017, 2018],
}
df = pd.DataFrame(data)
stats_categorical = df["product"].describe()
print(stats_categorical)
Here are the results:
count 5
unique 4
top C
freq 2
Name: product, dtype: object
Get the Descriptive Statistics for the Entire DataFrame
To get the descriptive statistics for the entire DataFrame:
import pandas as pd
data = {
"product": ["A", "B", "C", "C", "D"],
"price": [22000, 27000, 25000, 29000, 35000],
"year": [2014, 2015, 2016, 2017, 2018],
}
df = pd.DataFrame(data)
stats = df.describe(include="all")
print(stats)
The result:
product price year
count 5 5.000000 5.000000
unique 4 NaN NaN
top C NaN NaN
freq 2 NaN NaN
mean NaN 27600.000000 2016.000000
std NaN 4878.524367 1.581139
min NaN 22000.000000 2014.000000
25% NaN 25000.000000 2015.000000
50% NaN 27000.000000 2016.000000
75% NaN 29000.000000 2017.000000
max NaN 35000.000000 2018.000000
Breaking Down the Descriptive Statistics
You can further breakdown the descriptive statistics into the following:
Count:
df["dataframe_column"].count()
Mean:
df["dataframe_column"].mean()
Standard deviation:
df["dataframe_column"].std()
Minimum:
df["dataframe_column"].min()
0.25 Quantile:
df["dataframe_column"].quantile(q=0.25)
0.50 Quantile (Median):
df["dataframe_column"].quantile(q=0.50)
0.75 Quantile:
df["dataframe_column"].quantile(q=0.75)
Maximum:
df["dataframe_column"].max()
Putting everything together:
import pandas as pd
data = {
"product": ["A", "B", "C", "C", "D"],
"price": [22000, 27000, 25000, 29000, 35000],
"year": [2014, 2015, 2016, 2017, 2018],
}
df = pd.DataFrame(data)
statistics = {
"count": df["price"].count(),
"mean": df["price"].mean(),
"std": df["price"].std(),
"min": df["price"].min(),
"quantile_25": df["price"].quantile(q=0.25),
"quantile_50": df["price"].quantile(q=0.50),
"quantile_75": df["price"].quantile(q=0.75),
"max": df["price"].max(),
}
for stat, value in statistics.items():
print(f"{stat}: {value}")
Once you run the code in Python, you’ll get the following stats:
count: 5
mean: 27600.0
std: 4878.524367060188
min: 22000
quantile_25: 25000.0
quantile_50: 27000.0
quantile_75: 29000.0
max: 35000