To get the descriptive statistics for a specific column in your DataFrame:
df['dataframe_column'].describe()
To get the descriptive statistics for an entire DataFrame:
df.describe(include='all')
Steps to Get the Descriptive Statistics for Pandas DataFrame
Step 1: Collect the Data
To start, you’ll need to collect the data for your DataFrame.
For example, here is a simple dataset that can be used for our DataFrame:
product | price | year |
A | 22000 | 2014 |
B | 27000 | 2015 |
C | 25000 | 2016 |
C | 29000 | 2017 |
D | 35000 | 2018 |
Step 2: Create the DataFrame
Next, create the DataFrame based on the data collected.
Here is the code to create the DataFrame for our example:
import pandas as pd data = {'product': ['A', 'B', 'C', 'C', 'D'], 'price': [22000, 27000, 25000, 29000, 35000], 'year': [2014, 2015, 2016, 2017, 2018] } df = pd.DataFrame(data) print(df)
Run the code in Python, and you’ll get the following DataFrame:
product price year
0 A 22000 2014
1 B 27000 2015
2 C 25000 2016
3 C 29000 2017
4 D 35000 2018
Step 3: Get the Descriptive Statistics for Pandas DataFrame
Once you have your DataFrame ready, you’ll be able to get the descriptive statistics using the template that you saw at the beginning of this guide:
df['dataframe_column'].describe()
Let’s say that you want to get the descriptive statistics for the ‘price‘ field, which contains numerical data. In that case, the syntax that you’ll need to apply is:
df['price'].describe()
So the complete Python code would look like this:
import pandas as pd data = {'product': ['A', 'B', 'C', 'C', 'D'], 'price': [22000, 27000, 25000, 29000, 35000], 'year': [2014, 2015, 2016, 2017, 2018] } df = pd.DataFrame(data) stats_numeric = df['price'].describe() print(stats_numeric)
Once you run the code, you’ll get the descriptive statistics for the ‘price’ field:
count 5.000000
mean 27600.000000
std 4878.524367
min 22000.000000
25% 25000.000000
50% 27000.000000
75% 29000.000000
max 35000.000000
Name: price, dtype: float64
You’ll notice that the output contains 6 decimal places. You may then add astype(int) to the code to get integer values.
This is how the code would look like:
import pandas as pd data = {'product': ['A', 'B', 'C', 'C', 'D'], 'price': [22000, 27000, 25000, 29000, 35000], 'year': [2014, 2015, 2016, 2017, 2018] } df = pd.DataFrame(data) stats_numeric = df['price'].describe().astype(int) print(stats_numeric)
Run the code, and you’ll get only integers:
count 5
mean 27600
std 4878
min 22000
25% 25000
50% 27000
75% 29000
max 35000
Name: price, dtype: int32
Descriptive Statistics for Categorical Data
So far, you have seen how to get the descriptive statistics for numerical data. The ‘price’ field was used for that purpose.
Yet, you can also get the descriptive statistics for categorical data.
For instance, you can get some descriptive statistics for the ‘product‘ field using this code:
import pandas as pd data = {'product': ['A', 'B', 'C', 'C', 'D'], 'price': [22000, 27000, 25000, 29000, 35000], 'year': [2014, 2015, 2016, 2017, 2018] } df = pd.DataFrame(data) stats_categorical = df['product'].describe() print(stats_categorical)
Here are the results:
count 5
unique 4
top C
freq 2
Name: product, dtype: object
Get the Descriptive Statistics for the Entire DataFrame
Finally, you may apply the following template to get the descriptive statistics for the entire DataFrame:
df.describe(include='all')
So the complete Python code would look like this:
import pandas as pd data = {'product': ['A', 'B', 'C', 'C', 'D'], 'price': [22000, 27000, 25000, 29000, 35000], 'year': [2014, 2015, 2016, 2017, 2018] } df = pd.DataFrame(data) stats = df.describe(include='all') print(stats)
Run the code, and you’ll get the following result:
product price year
count 5 5.000000 5.000000
unique 4 NaN NaN
top C NaN NaN
freq 2 NaN NaN
mean NaN 27600.000000 2016.000000
std NaN 4878.524367 1.581139
min NaN 22000.000000 2014.000000
25% NaN 25000.000000 2015.000000
50% NaN 27000.000000 2016.000000
75% NaN 29000.000000 2017.000000
max NaN 35000.000000 2018.000000
Breaking Down the Descriptive Statistics
You can further breakdown the descriptive statistics into the following:
Count:
df['dataframe_column'].count()
Mean:
df['dataframe_column'].mean()
Standard deviation:
df['dataframe_column'].std()
Minimum:
df['dataframe_column'].min()
0.25 Quantile:
df['dataframe_column'].quantile(q=0.25)
0.50 Quantile (Median):
df['dataframe_column'].quantile(q=0.50)
0.75 Quantile:
df['dataframe_column'].quantile(q=0.75)
Maximum:
df['dataframe_column'].max()
For our example, the df[‘dataframe_column’] is df[‘price’].
Therefore, the full Python code would look as follows:
import pandas as pd data = {'product': ['A', 'B', 'C', 'C', 'D'], 'price': [22000, 27000, 25000, 29000, 35000], 'year': [2014, 2015, 2016, 2017, 2018] } df = pd.DataFrame(data) count1 = df['price'].count() print('count: ' + str(count1)) mean1 = df['price'].mean() print('mean: ' + str(mean1)) std1 = df['price'].std() print('std: ' + str(std1)) min1 = df['price'].min() print('min: ' + str(min1)) quantile1 = df['price'].quantile(q=0.25) print('25%: ' + str(quantile1)) quantile2 = df['price'].quantile(q=0.50) print('50%: ' + str(quantile2)) quantile3 = df['price'].quantile(q=0.75) print('75%: ' + str(quantile3)) max1 = df['price'].max() print('max: ' + str(max1))
Once you run the code in Python, you’ll get the following stats:
count: 5
mean: 27600.0
std: 4878.524367060188
min: 22000
25%: 25000.0
50%: 27000.0
75%: 29000.0
max: 35000