Use Pandas to Calculate Stats from an Imported CSV file

Pandas is a powerful Python package that can be used to perform statistical analysis. In this short post, I’ll show you how to use pandas to calculate stats from an imported CSV file.

The Example

To demonstrate how to calculate stats from an imported CSV file, I’ll use a simple example with the following data-set:

 

NameSalaryCountry
Dan40000USA
Elizabeth32000Brazil
Jon45000Italy
Maria54000USA
Mark72000USA
Bill62000Brazil
Jess92000Italy
Julia55000USA
Jeff35000Italy
Ben48000Brazil

Steps to Calculate Stats from an Imported CSV file

Step 1: Copy the data-set into a CSV file

To start, you’ll need to copy the above data-set into a CSV file. Then rename the CSV file as stats:

pandas stats

Step 2: Import the CSV file into Python

Next, you’ll need to import the CSV file into Python using this structure:

import pandas as pd
df = pd.read_csv (r'Path where the CSV file is stored\File name.csv')

For example, this is the path where I stored the CSV file:

 C:\Users\doron\Desktop\stats.csv

So the code to import the stats CSV file in my case is:

import pandas as pd
df = pd.read_csv (r'C:\Users\doron\Desktop\stats.csv') 

In the Python code below, you’ll need to change the path to the location where the CSV file is stored on your computer.

Step 3: Use Pandas to Calculate Stats from an Imported CSV file

For the final step, the goal is to calculate the following statistics using the pandas package:

  • Mean salary
  • Total sum of salaries
  • Maximum salary
  • Minimum salary
  • Count of salaries
  • Median salary
  • Standard deviation of salaries
  • Variance of of salaries

In addition, we will also do some grouping calculations:

  • Sum of salaries, grouped by the Country column
  • Count of salaries, grouped by the Country column

Run the following Python code (after adjusting the path name) in order to calculate the stats from the imported CSV file using pandas:

 

import pandas as pd
df = pd.read_csv (r'C:\Users\doron\Desktop\stats.csv') 

# block 1 - simple stats
mean1 = df['Salary'].mean()
sum1 = df['Salary'].sum()
max1 = df['Salary'].max()
min1 = df['Salary'].min()
count1 = df['Salary'].count()
median1 = df['Salary'].median() 
std1 = df['Salary'].std() 
var1 = df['Salary'].var()  

# block 2 - group by
groupby_sum1 = df.groupby(['Country']).sum() 
groupby_count1 = df.groupby(['Country']).count()

# print block 1
print ('Mean salary: ' + str(mean1))
print ('Sum of salaries: ' + str(sum1))
print ('Max salary: ' + str(max1))
print ('Min salary: ' + str(min1))
print ('Count of salaries: ' + str(count1))
print ('Median salary: ' + str(median1))
print ('Std of salaries: ' + str(std1))
print ('Var of salaries: ' + str(var1))

# print block 2
print ('Sum of values, grouped by the Country: ' + str(groupby_sum1))
print ('Count of values, grouped by the Country: ' + str(groupby_count1))

 

Once you run the code in Python, you’ll get the following results:

 

Use Pandas to Calculate Stats from an Imported CSV file

 

Here is a table that summarizes the operations performed in the code:

 

VariableSyntax usedDescription
mean1
df['Salary'].mean()
Average of all values under the Salary column
sum1
df['Salary'].sum()
Sum of all values under the Salary column
max1
df['Salary'].max()
Max of all values under the Salary column
min1
df['Salary'].min()
Min of all values under the Salary column
count1
df['Salary'].count()
Count of all values under the Salary column
median1
df['Salary'].median()
Median of all values under the Salary column
std1
df['Salary'].std()
Standard deviation of all values under the Salary column
var1
df['Salary'].var()
Variance of all values under the Salary column
groupby_sum1
df.groupby(['Country']).sum()
Sum of salaries, grouped by the Country column
groupby_count1
df.groupby(['Country']).count()
Count of salaries, grouped by the Country column

 

You just saw how to calculate simple stats using pandas. You may also want to check the pandas documentation to learn more about the power of this great library!