Example of Random Forest in Python

In this guide, I’ll show you an example of Random Forest in Python.

In general, Random Forest is a form of supervised machine learning, and can be used for both Classification and Regression.

By the end of this guide, you’ll be able to create the following Graphical User Interface (GUI) to perform predictions based on the Random Forest model:

tkinter GUI

The Example

Let’s say that your goal is to predict whether a candidate will get admitted to a prestigious university. There are 3 possible outcomes:

  • Candidate is admitted – represented by the value of 2
  • Candidate is on the waiting list – represented by the value of 1
  • Candidate is not admitted – represented by the value of 0

Below is the full dataset that will be used for our example:

Example of dataset

In our example:

  • The gmat, gpa, work_experience and age are the features variables
  • The admitted column represents the label/target

Note that the above dataset contains 40 observations. In practice, you may need a larger sample size to get more accurate results.

Steps to Apply Random Forest in Python

Step 1: Install the Relevant Python Packages

If you haven’t already done so, install the following Python Packages:

  • pandas – used to create the DataFrame to capture the dataset in Python
  • sklearn – used to perform the Random Forest
  • seaborn – used to create the Confusion Matrix
  • matplotlib – used to display charts

You may apply the PIP install method to install those packages.

You’ll then need to import the Python packages as follows:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
import seaborn as sn

Step 2: Create the DataFrame

Next, create the DataFrame to capture the dataset for our example:

import pandas as pd

candidates = {'gmat': [780,750,690,710,780,730,690,720,740,690,610,690,710,680,770,610,580,650,540,590,620,600,550,550,570,670,660,580,650,760,640,620,660,660,680,650,670,580,590,790],
              'gpa': [4,3.9,3.3,3.7,3.9,3.7,2.3,3.3,3.3,1.7,2.7,3.7,3.7,3.3,3.3,3,2.7,3.7,2.7,2.3,3.3,2,2.3,2.7,3,3.3,3.7,2.3,3.7,3.3,3,2.7,4,3.3,3.3,2.3,2.7,3.3,1.7,3.7],
              'work_experience': [3,4,3,5,4,6,1,4,5,1,3,5,6,4,3,1,4,6,2,3,2,1,4,1,2,6,4,2,6,5,1,2,4,6,5,1,2,1,4,5],
              'age': [25,28,24,27,26,31,24,25,28,23,25,27,30,28,26,23,29,31,26,26,25,24,28,23,25,29,28,26,30,30,23,24,27,29,28,22,23,24,28,31],
              'admitted': [2,2,1,2,2,2,0,2,2,0,0,2,2,1,2,0,0,1,0,0,1,0,0,0,0,1,1,0,1,2,0,0,1,1,1,0,0,0,0,2]
              }

df = pd.DataFrame(candidates,columns= ['gmat', 'gpa','work_experience','age','admitted'])
print (df)

Alternatively, you can import the data into Python from an external file.

Step 3: Apply the Random Forest in Python

Now, set the features (represented as X) and the label (represented as y):

X = df[['gmat', 'gpa','work_experience','age']]
y = df['admitted']

Then, apply train_test_split. For example, you can set the test size to 0.25, and therefore the model testing will be based on 25% of the dataset, while the model training will be based on 75% of the dataset:

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=0)

Apply the Random Forest as follows:

clf = RandomForestClassifier(n_estimators=100)
clf.fit(X_train,y_train)
y_pred=clf.predict(X_test)

Next, add this code to get the Confusion Matrix:

confusion_matrix = pd.crosstab(y_test, y_pred, rownames=['Actual'], colnames=['Predicted'])
sn.heatmap(confusion_matrix, annot=True)

Finally, print the Accuracy:

print('Accuracy: ',metrics.accuracy_score(y_test, y_pred))

Putting all the above components together:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
import seaborn as sn

candidates = {'gmat': [780,750,690,710,780,730,690,720,740,690,610,690,710,680,770,610,580,650,540,590,620,600,550,550,570,670,660,580,650,760,640,620,660,660,680,650,670,580,590,790],
              'gpa': [4,3.9,3.3,3.7,3.9,3.7,2.3,3.3,3.3,1.7,2.7,3.7,3.7,3.3,3.3,3,2.7,3.7,2.7,2.3,3.3,2,2.3,2.7,3,3.3,3.7,2.3,3.7,3.3,3,2.7,4,3.3,3.3,2.3,2.7,3.3,1.7,3.7],
              'work_experience': [3,4,3,5,4,6,1,4,5,1,3,5,6,4,3,1,4,6,2,3,2,1,4,1,2,6,4,2,6,5,1,2,4,6,5,1,2,1,4,5],
              'age': [25,28,24,27,26,31,24,25,28,23,25,27,30,28,26,23,29,31,26,26,25,24,28,23,25,29,28,26,30,30,23,24,27,29,28,22,23,24,28,31],
              'admitted': [2,2,1,2,2,2,0,2,2,0,0,2,2,1,2,0,0,1,0,0,1,0,0,0,0,1,1,0,1,2,0,0,1,1,1,0,0,0,0,2]
              }

df = pd.DataFrame(candidates,columns= ['gmat', 'gpa','work_experience','age','admitted'])
#print (df)

X = df[['gmat', 'gpa','work_experience','age']]
y = df['admitted']

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=0)

clf = RandomForestClassifier(n_estimators=100)
clf.fit(X_train,y_train)
y_pred=clf.predict(X_test)

confusion_matrix = pd.crosstab(y_test, y_pred, rownames=['Actual'], colnames=['Predicted'])
sn.heatmap(confusion_matrix, annot=True)

print('Accuracy: ',metrics.accuracy_score(y_test, y_pred))

Run the code in Python, and you’ll get the Accuracy of 0.8, followed by the Confusion Matrix:

Confusion Matrix Python

You can also derive the Accuracy from the Confusion Matrix:

Accuracy = (Sum of values on the main diagonal)/(Sum of all values on the matrix)

And for our example:

Accuracy = (4+2+2)/(4+2+2+1+1) = 0.8

Let’s now dive deeper into the results by printing the following two components in the python code:

  • print (X_test)
  • print (y_pred)

Here is the code used:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics

candidates = {'gmat': [780,750,690,710,780,730,690,720,740,690,610,690,710,680,770,610,580,650,540,590,620,600,550,550,570,670,660,580,650,760,640,620,660,660,680,650,670,580,590,790],
              'gpa': [4,3.9,3.3,3.7,3.9,3.7,2.3,3.3,3.3,1.7,2.7,3.7,3.7,3.3,3.3,3,2.7,3.7,2.7,2.3,3.3,2,2.3,2.7,3,3.3,3.7,2.3,3.7,3.3,3,2.7,4,3.3,3.3,2.3,2.7,3.3,1.7,3.7],
              'work_experience': [3,4,3,5,4,6,1,4,5,1,3,5,6,4,3,1,4,6,2,3,2,1,4,1,2,6,4,2,6,5,1,2,4,6,5,1,2,1,4,5],
              'age': [25,28,24,27,26,31,24,25,28,23,25,27,30,28,26,23,29,31,26,26,25,24,28,23,25,29,28,26,30,30,23,24,27,29,28,22,23,24,28,31],
              'admitted': [2,2,1,2,2,2,0,2,2,0,0,2,2,1,2,0,0,1,0,0,1,0,0,0,0,1,1,0,1,2,0,0,1,1,1,0,0,0,0,2]
              }

df = pd.DataFrame(candidates,columns= ['gmat', 'gpa','work_experience','age','admitted'])
#print (df)

X = df[['gmat', 'gpa','work_experience','age']]
y = df['admitted']

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=0)

clf = RandomForestClassifier(n_estimators=100)
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)

print (X_test) #test dataset (without the actual outcome)
print (y_pred) #predicted values

Recall that our original dataset had 40 observations. Since we set the test size to 0.25, then the Confusion Matrix displayed the results for a total of 10 records (=40*0.25). These are the 10 test records:

Example of Random Forest in Python

The prediction was also made for those 10 records (where 2 = admitted, 1 = waiting list, and 0 = not admitted):

Making predictions in Python

In the original dataset, you’ll see that for the test data, we got the correct results 8 out of 10 times:

Example of dataset

This is consistent with the accuracy level of 80%.

Step 4: Perform a Prediction

Let’s now perform a prediction to determine whether a new candidate will get admitted based on the following information:

  • gmat = 730
  • gpa = 3.7
  • work_experience = 4
  • age = 27

You’ll then need to add this syntax to make the prediction:

prediction = clf.predict([[730,3.7,4,27]]) 
print ('Predicted Result: ', prediction)

So this is how the full code would look like:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

candidates = {'gmat': [780,750,690,710,780,730,690,720,740,690,610,690,710,680,770,610,580,650,540,590,620,600,550,550,570,670,660,580,650,760,640,620,660,660,680,650,670,580,590,790],
              'gpa': [4,3.9,3.3,3.7,3.9,3.7,2.3,3.3,3.3,1.7,2.7,3.7,3.7,3.3,3.3,3,2.7,3.7,2.7,2.3,3.3,2,2.3,2.7,3,3.3,3.7,2.3,3.7,3.3,3,2.7,4,3.3,3.3,2.3,2.7,3.3,1.7,3.7],
              'work_experience': [3,4,3,5,4,6,1,4,5,1,3,5,6,4,3,1,4,6,2,3,2,1,4,1,2,6,4,2,6,5,1,2,4,6,5,1,2,1,4,5],
              'age': [25,28,24,27,26,31,24,25,28,23,25,27,30,28,26,23,29,31,26,26,25,24,28,23,25,29,28,26,30,30,23,24,27,29,28,22,23,24,28,31],
              'admitted': [2,2,1,2,2,2,0,2,2,0,0,2,2,1,2,0,0,1,0,0,1,0,0,0,0,1,1,0,1,2,0,0,1,1,1,0,0,0,0,2]
              }

df = pd.DataFrame(candidates,columns= ['gmat', 'gpa','work_experience','age','admitted'])
#print (df)

X = df[['gmat', 'gpa','work_experience','age']]
y = df['admitted']

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=0)

clf = RandomForestClassifier(n_estimators=100)
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)

prediction = clf.predict([[730,3.7,4,27]])
print ('Predicted Result: ', prediction)

Once you run the code, you’ll get the value of 2, which means that the candidate is expected to be admitted:

Making predictions

You can take things further by creating a simple Graphical User Interface (GUI) where you’ll be able to input the features variables in order to get the prediction.

Here is the full code that you can apply to create the GUI (based on the tkinter package):

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
import tkinter as tk 

candidates = {'gmat': [780,750,690,710,780,730,690,720,740,690,610,690,710,680,770,610,580,650,540,590,620,600,550,550,570,670,660,580,650,760,640,620,660,660,680,650,670,580,590,790],
              'gpa': [4,3.9,3.3,3.7,3.9,3.7,2.3,3.3,3.3,1.7,2.7,3.7,3.7,3.3,3.3,3,2.7,3.7,2.7,2.3,3.3,2,2.3,2.7,3,3.3,3.7,2.3,3.7,3.3,3,2.7,4,3.3,3.3,2.3,2.7,3.3,1.7,3.7],
              'work_experience': [3,4,3,5,4,6,1,4,5,1,3,5,6,4,3,1,4,6,2,3,2,1,4,1,2,6,4,2,6,5,1,2,4,6,5,1,2,1,4,5],
              'age': [25,28,24,27,26,31,24,25,28,23,25,27,30,28,26,23,29,31,26,26,25,24,28,23,25,29,28,26,30,30,23,24,27,29,28,22,23,24,28,31],
              'admitted': [2,2,1,2,2,2,0,2,2,0,0,2,2,1,2,0,0,1,0,0,1,0,0,0,0,1,1,0,1,2,0,0,1,1,1,0,0,0,0,2]
              }

df = pd.DataFrame(candidates,columns= ['gmat', 'gpa','work_experience','age','admitted'])
#print (df)

X = df[['gmat', 'gpa','work_experience','age']]
y = df['admitted']

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=0)

clf = RandomForestClassifier(n_estimators=100)
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)

# tkinter GUI
root= tk.Tk()

canvas1 = tk.Canvas(root, width = 500, height = 350)
canvas1.pack()

# GMAT
label1 = tk.Label(root, text='            GMAT:')
canvas1.create_window(100, 100, window=label1)

entry1 = tk.Entry (root)
canvas1.create_window(270, 100, window=entry1)

# GPA
label2 = tk.Label(root, text='GPA:     ')
canvas1.create_window(120, 120, window=label2)

entry2 = tk.Entry (root)
canvas1.create_window(270, 120, window=entry2)

# work_experience
label3 = tk.Label(root, text='     Work Experience: ')
canvas1.create_window(140, 140, window=label3)

entry3 = tk.Entry (root)
canvas1.create_window(270, 140, window=entry3)

# Age input
label4 = tk.Label(root, text='Age:                               ')
canvas1.create_window(160, 160, window=label4)

entry4 = tk.Entry (root)
canvas1.create_window(270, 160, window=entry4)

def values(): 
    global gmat
    gmat = float(entry1.get()) 
    
    global gpa
    gpa = float(entry2.get()) 
    
    global work_experience
    work_experience = float(entry3.get()) 
    
    global age
    age = float(entry4.get()) 
    
    Prediction_result  = ('  Predicted Result: ', clf.predict([[gmat,gpa,work_experience,age]]))
    label_Prediction = tk.Label(root, text= Prediction_result, bg='sky blue')
    canvas1.create_window(270, 280, window=label_Prediction)
    
button1 = tk.Button (root, text='      Predict      ',command=values, bg='green', fg='white', font=11)
canvas1.create_window(270, 220, window=button1)
 
root.mainloop()

Run the code, and you’ll get this display:

Example of Random Forest in Python

Type the following values for the new candidate:

Random Forest in Python

Once you are done entering the values in the entry boxes, click on the ‘Predict‘ button and you’ll get the prediction of 2 (i.e., the candidate is expected to get admitted):

tkinter GUI

You may try different combination of values to see the predicted result.

How to Determine the Importance of Features

In the last section of this guide, you’ll see how to obtain the importance scores for the features. Generally speaking, you may consider to exclude features which have a low score.

Here is the syntax that you’ll need to add in order to get the features importance:

featureImportances = pd.Series(clf.feature_importances_).sort_values(ascending=False)
print(featureImportances)

sn.barplot(x=round(featureImportances,4), y=featureImportances)
plt.xlabel('Features Importance')
plt.show()

And here is the complete Python code (make sure that the matplotlib package is also imported):

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
import seaborn as sn
import matplotlib.pyplot as plt

candidates = {'gmat': [780,750,690,710,780,730,690,720,740,690,610,690,710,680,770,610,580,650,540,590,620,600,550,550,570,670,660,580,650,760,640,620,660,660,680,650,670,580,590,790],
              'gpa': [4,3.9,3.3,3.7,3.9,3.7,2.3,3.3,3.3,1.7,2.7,3.7,3.7,3.3,3.3,3,2.7,3.7,2.7,2.3,3.3,2,2.3,2.7,3,3.3,3.7,2.3,3.7,3.3,3,2.7,4,3.3,3.3,2.3,2.7,3.3,1.7,3.7],
              'work_experience': [3,4,3,5,4,6,1,4,5,1,3,5,6,4,3,1,4,6,2,3,2,1,4,1,2,6,4,2,6,5,1,2,4,6,5,1,2,1,4,5],
              'age': [25,28,24,27,26,31,24,25,28,23,25,27,30,28,26,23,29,31,26,26,25,24,28,23,25,29,28,26,30,30,23,24,27,29,28,22,23,24,28,31],
              'admitted': [2,2,1,2,2,2,0,2,2,0,0,2,2,1,2,0,0,1,0,0,1,0,0,0,0,1,1,0,1,2,0,0,1,1,1,0,0,0,0,2]
              }

df = pd.DataFrame(candidates,columns= ['gmat', 'gpa','work_experience','age','admitted'])


X = df[['gmat', 'gpa','work_experience','age']]
y = df['admitted']

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=0)

clf = RandomForestClassifier(n_estimators=100)
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)

featureImportances = pd.Series(clf.feature_importances_).sort_values(ascending=False)
print(featureImportances)

sn.barplot(x=round(featureImportances,4), y=featureImportances)
plt.xlabel('Features Importance')
plt.show()

As you may observe, the age has a low score (i.e., 0.046941), and therefore may be excluded from the model:

Example of Random Forest in Python