Use Pandas to Calculate Stats from an Imported CSV file

Pandas is a powerful Python package to perform statistical analysis for a given set of data. In this short post, I’m going to show you how to use pandas, in order to calculate stats from an imported CSV file.

To illustrate, I’ll use a simple example, where the following data will be saved in a CSV file called ‘stats’ (and where the salary column is saved as a number format):

NameSalaryCountry
Dan40000USA
Elizabeth32000Brazil
Jon45000Italy
Maria54000USA
Mark72000USA
Bill62000Brazil
Jess92000Italy
Julia55000USA
Jeff35000Italy
Ben48000Brazil

 

The above data-set is saved on my computer under the following path (you’ll need to change the path, in the Python code below, to the location where your CSV file is stored on your computer):

C:\Users\Doron E\Desktop\pandas\stats.csv

Now that we have the data-set ready, and it is saved at our desired location within a CSV file, we will calculate the following statistics using the pandas package:

  • Mean (i.e., average) salary
  • Total sum of salaries
  • Maximum salary
  • Minimum salary
  • Count of salaries
  • Median salary
  • Standard deviation of salaries
  • Variance of of salaries

In addition, we will also do some grouping calculations:

  • Sum of salaries, grouped by the Country column
  • Count of salaries, grouped by the Country column

Use Pandas to Calculate Stats from an Imported CSV file

Now let’s look at the Python code used in order to calculate stats from an import CSV file using pandas. Please note that I included additional comments within the code to clarify the syntax used.

 

import pandas as pd
#read the CSV file
df = pd.read_csv (r'C:\Users\Doron E\Desktop\pandas\stats.csv',header='infer') # put 'r' before the path name to address any special characters such as '\'. Also do not forget to put the file name in the path (here it is 'stats') as well as the suffix '.csv'

# block 1 - simple stats
mean1 = df['Salary'].mean() # calculates the average of all values under the Salary column
sum1 = df['Salary'].sum() # calculates the sum of all values under the Salary column
max1 = df['Salary'].max() # calculates the max of all values under the Salary column
min1 = df['Salary'].min() # calculates the min of all values under the Salary column
count1 = df['Salary'].count() # calculates the count of all values under the Salary column
median1 = df['Salary'].median () # calculates the median of all values under the Salary column
std1 = df['Salary'].std () # calculates the std of all values under the Salary column
var1 = df['Salary'].var () # calculates the var of all values under the Salary column

# block 2 - group by
groupby_sum1 = df.groupby(['Country']).sum() # calculates the sum of salaries, grouped by the Country column
groupby_count1 = df.groupby(['Country']).count() # calculates the count of salaries, grouped by the Country column

# print block 1
print ('The average of values under the Salary column is: ' + str(mean1))
print ('The sum of values under the Salary column is: ' + str(sum1))
print ('The max of values under the Salary column is: ' + str(max1))
print ('The min of values under the Salary column is: ' + str(min1))
print ('The count of values under the Salary column is: ' + str(count1))
print ('The median of values under the Salary column is: ' + str(median1))
print ('The std of values under the Salary column is: ' + str(std1))
print ('The var of values under the Salary column is: ' + str(var1))

# print block 2
print ('The sum of values, grouped by the Country column, is: ' + str(groupby_sum1))
print ('The count of values, grouped by the Country column, is: ' + str(groupby_count1))

 

You should get the following results when you run the above code in Python:

 

Use Pandas to Calculate Stats from an Imported CSV file

 

We just saw how you can calculate simple stats using pandas. You may also want to check the pandas documentation to learn more about the power of this great library!