In this tutorial, I’m going to show you how to perform multiple linear regression in Python using both *sklearn* and *statsmodels*.

Here are the topics to be covered:

- Reviewing an example and the data-set to be used in this tutorial
- Checking for Linearity
- Performing the multiple linear regression in Python
- Adding a
*tkinter*Graphical User Interface (GUI) to gather input from users, and then display the prediction results

By the end of this tutorial, you would be able to create the following interface in Python:

While this tutorial focuses on executing the below example in Python, you may wish to check the following source, which provides additional background about linear regression and statsmodels.

## Example of Multiple Linear Regression in Python

In the following example, we will use multiple linear regression to predict the stock index price (i.e., the dependent variable) of a fictitious economy by using 2 independent/input variables:

- Interest Rate
- Unemployment Rate

Please note that you will have to validate that several assumptions are met before you apply linear regression models. Most notably, you have to make sure that a linear relationship exists between the dependent variable and the independent variable/s (more on that under the *checking for linearity* section).

Let’s now jump into the data-set that we’ll be using:

Year | Month | Interest_Rate | Unemployment_Rate | Stock_Index_Price |

2017 | 12 | 2.75 | 5.3 | 1464 |

2017 | 11 | 2.5 | 5.3 | 1394 |

2017 | 10 | 2.5 | 5.3 | 1357 |

2017 | 9 | 2.5 | 5.3 | 1293 |

2017 | 8 | 2.5 | 5.4 | 1256 |

2017 | 7 | 2.5 | 5.6 | 1254 |

2017 | 6 | 2.5 | 5.5 | 1234 |

2017 | 5 | 2.25 | 5.5 | 1195 |

2017 | 4 | 2.25 | 5.5 | 1159 |

2017 | 3 | 2.25 | 5.6 | 1167 |

2017 | 2 | 2 | 5.7 | 1130 |

2017 | 1 | 2 | 5.9 | 1075 |

2016 | 12 | 2 | 6 | 1047 |

2016 | 11 | 1.75 | 5.9 | 965 |

2016 | 10 | 1.75 | 5.8 | 943 |

2016 | 9 | 1.75 | 6.1 | 958 |

2016 | 8 | 1.75 | 6.2 | 971 |

2016 | 7 | 1.75 | 6.1 | 949 |

2016 | 6 | 1.75 | 6.1 | 884 |

2016 | 5 | 1.75 | 6.1 | 866 |

2016 | 4 | 1.75 | 5.9 | 876 |

2016 | 3 | 1.75 | 6.2 | 822 |

2016 | 2 | 1.75 | 6.2 | 704 |

2016 | 1 | 1.75 | 6.1 | 719 |

To start, you’ll need to import that data into Python. Alternatively, you can create that data-set directly in Python. I’ll show you how to apply both approaches using the *pandas* library.

### Approach#1: Import the data into Python

For the first approach, you may copy the above table into a CSV file, and then name that CSV file as ‘Economy’ for example. You will then assign that data into the DataFrame.

Here is the Python code that you may use. Note that you’ll need to change the path name to the location where your CSV file is stored on your computer.

import pandas as pd from pandas import DataFrame Stock_Market = pd.read_csv(r'C:\Users\Doron E\Desktop\Economy.csv') df = DataFrame(Stock_Market,columns=['Year','Month','Interest_Rate','Unemployment_Rate','Stock_Index_Price']) print (df)

You also need to make sure that the column names specified in the code exactly match with the column names within the CSV file. Otherwise, you’ll get NaN values.

### Approach#2: Create the data-set in Python directly

Alternatively, you may create the data-set in Python without the need to import the data. This is how the code should look like:

from pandas import DataFrame Stock_Market = {'Year': [2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016], 'Month': [12, 11,10,9,8,7,6,5,4,3,2,1,12,11,10,9,8,7,6,5,4,3,2,1], 'Interest_Rate': [2.75,2.5,2.5,2.5,2.5,2.5,2.5,2.25,2.25,2.25,2,2,2,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75], 'Unemployment_Rate': [5.3,5.3,5.3,5.3,5.4,5.6,5.5,5.5,5.5,5.6,5.7,5.9,6,5.9,5.8,6.1,6.2,6.1,6.1,6.1,5.9,6.2,6.2,6.1], 'Stock_Index_Price': [1464,1394,1357,1293,1256,1254,1234,1195,1159,1167,1130,1075,1047,965,943,958,971,949,884,866,876,822,704,719] } df = DataFrame(Stock_Market,columns=['Year','Month','Interest_Rate','Unemployment_Rate','Stock_Index_Price']) print (df)

Each of these two approaches would work. Feel free to choose the one that you’re most comfortable with.

## Checking for Linearity

Before you execute a linear regression model, it is advisable to validate that certain assumptions are met.

As noted earlier, you may want to check that a linear relationship exists between the dependent variable and the independent variable/s.

In our example, you may want to check that a linear relationship exists between:

- The Stock_Index_Price (dependent variable) and the Interest_Rate (independent variable); and
- The Stock_Index_Price (dependent variable) and the Unemployment_Rate (independent variable)

To perform a quick linearity check, you can use scatter diagrams (utilizing the *matplotlib *library):

from pandas import DataFrame import matplotlib.pyplot as plt Stock_Market = {'Year': [2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016], 'Month': [12, 11,10,9,8,7,6,5,4,3,2,1,12,11,10,9,8,7,6,5,4,3,2,1], 'Interest_Rate': [2.75,2.5,2.5,2.5,2.5,2.5,2.5,2.25,2.25,2.25,2,2,2,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75], 'Unemployment_Rate': [5.3,5.3,5.3,5.3,5.4,5.6,5.5,5.5,5.5,5.6,5.7,5.9,6,5.9,5.8,6.1,6.2,6.1,6.1,6.1,5.9,6.2,6.2,6.1], 'Stock_Index_Price': [1464,1394,1357,1293,1256,1254,1234,1195,1159,1167,1130,1075,1047,965,943,958,971,949,884,866,876,822,704,719] } df = DataFrame(Stock_Market,columns=['Year','Month','Interest_Rate','Unemployment_Rate','Stock_Index_Price']) plt.scatter(df['Interest_Rate'], df['Stock_Index_Price'], color='red') plt.title('Stock Index Price Vs Interest Rate', fontsize=14) plt.xlabel('Interest Rate', fontsize=14) plt.ylabel('Stock Index Price', fontsize=14) plt.grid(True) plt.show() plt.scatter(df['Unemployment_Rate'], df['Stock_Index_Price'], color='green') plt.title('Stock Index Price Vs Unemployment Rate', fontsize=14) plt.xlabel('Unemployment Rate', fontsize=14) plt.ylabel('Stock Index Price', fontsize=14) plt.grid(True) plt.show()

Once you run the code, you’ll get the following two diagrams:

As you can see, a linear relationship exists in both cases:

- In the first case, when interest rates go up, the stock index price also goes up
- In the second case, when unemployment rates go up, the stock index price goes down (here we still have a linear relationship, but with a negative slope)

Next, we are going to perform the actual multiple linear regression in Python.

## Performing the Multiple Linear Regression

Once you imported/created the data in Python, you can use both sklearn and statsmodels to get the regression results.

Either method would work, but I’ll show you both methods for illustration purposes.

You may then copy the below code into Python, before we dive into the results:

from pandas import DataFrame from sklearn import linear_model import statsmodels.api as sm Stock_Market = {'Year': [2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016], 'Month': [12, 11,10,9,8,7,6,5,4,3,2,1,12,11,10,9,8,7,6,5,4,3,2,1], 'Interest_Rate': [2.75,2.5,2.5,2.5,2.5,2.5,2.5,2.25,2.25,2.25,2,2,2,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75], 'Unemployment_Rate': [5.3,5.3,5.3,5.3,5.4,5.6,5.5,5.5,5.5,5.6,5.7,5.9,6,5.9,5.8,6.1,6.2,6.1,6.1,6.1,5.9,6.2,6.2,6.1], 'Stock_Index_Price': [1464,1394,1357,1293,1256,1254,1234,1195,1159,1167,1130,1075,1047,965,943,958,971,949,884,866,876,822,704,719] } df = DataFrame(Stock_Market,columns=['Year','Month','Interest_Rate','Unemployment_Rate','Stock_Index_Price']) X = df[['Interest_Rate','Unemployment_Rate']] # here we have 2 variables for multiple regression. If you just want to use one variable for simple linear regression, then use X = df['Interest_Rate'] for example.Alternatively, you may add additional variables within the brackets Y = df['Stock_Index_Price'] # with sklearn regr = linear_model.LinearRegression() regr.fit(X, Y) print('Intercept: \n', regr.intercept_) print('Coefficients: \n', regr.coef_) # prediction with sklearn New_Interest_Rate = 2.75 New_Unemployment_Rate = 5.3 print ('Predicted Stock Index Price: \n', regr.predict([[New_Interest_Rate ,New_Unemployment_Rate]])) # with statsmodels X = sm.add_constant(X) # adding a constant model = sm.OLS(Y, X).fit() predictions = model.predict(X) print_model = model.summary() print(print_model)

Once you run the code in Python, you’ll observe three parts:

**(1) The first part shows the output generated by ***sklearn:*

*sklearn:*

This output includes the intercept and coefficients. You can use this information to build the multiple linear regression equation as follows:

Stock_Index_Price = (Intercept) + (Interest_Rate coef)*X_{1} + (Unemployment_Rate coef)*X_{2}

And once you plug the numbers:

Stock_Index_Price = (1798.4040) + (345.5401)*X_{1} + (-250.1466)*X_{2}

**(2) The second part displays the predicted output by using ***sklearn:*

*sklearn:*

Imagine that you want to predict the stock index price after you collected the following data:

- Interest Rate = 2.75 (i.e., X
_{1}= 2.75) - Unemployment Rate = 5.3 (i.e., X
_{2}= 5.3)

If you plug that data into the regression equation, you’ll get the exact same predicted results as displayed in the second part:

Stock_Index_Price = (1798.4040) + (345.5401)*(2.75) + (-250.1466)*(5.3) = 1422.86

**(3) The third part displays a comprehensive table with statistical info generated by ***statsmodels*.

*statsmodels*.

This information can provide you additional insights about the model used (such as the fit of the model, standard errors, etc.)

Notice that the coefficients captured in this table (highlighted in red) match with the coefficients generated by sklearn.

That’s a good sign! we got consistent results by applying both sklearn and statsmodels.

Next, I’ll show you how to create a GUI in Python to gather input from users, and then display the prediction results.

## GUI used for the Multiple Linear Regression in Python

This is where the real fun begins!

Why not create a GUI that will allow users input the independent variables in order to get the predicted result?

It may be that some of the users may not know much about inputting the data in the Python code itself, so it makes sense to create them a simple interface where they can manage the data in a simplified manner.

You can even create a batch file to launch the Python program, and so the users will just need to double-click the batch file in order to display the GUI.

Here is the full Python code for your ultimate Regression GUI:

from pandas import DataFrame from sklearn import linear_model import tkinter as tk import statsmodels.api as sm Stock_Market = {'Year': [2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016], 'Month': [12, 11,10,9,8,7,6,5,4,3,2,1,12,11,10,9,8,7,6,5,4,3,2,1], 'Interest_Rate': [2.75,2.5,2.5,2.5,2.5,2.5,2.5,2.25,2.25,2.25,2,2,2,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75], 'Unemployment_Rate': [5.3,5.3,5.3,5.3,5.4,5.6,5.5,5.5,5.5,5.6,5.7,5.9,6,5.9,5.8,6.1,6.2,6.1,6.1,6.1,5.9,6.2,6.2,6.1], 'Stock_Index_Price': [1464,1394,1357,1293,1256,1254,1234,1195,1159,1167,1130,1075,1047,965,943,958,971,949,884,866,876,822,704,719] } df = DataFrame(Stock_Market,columns=['Year','Month','Interest_Rate','Unemployment_Rate','Stock_Index_Price']) X = df[['Interest_Rate','Unemployment_Rate']] # here we have 2 input variables for multiple regression. If you just want to use one variable for simple linear regression, then use X = df['Interest_Rate'] for example.Alternatively, you may add additional variables within the brackets Y = df['Stock_Index_Price'] # output variable (what we are trying to predict) # with sklearn regr = linear_model.LinearRegression() regr.fit(X, Y) print('Intercept: \n', regr.intercept_) print('Coefficients: \n', regr.coef_) # with statsmodels X = sm.add_constant(X) # adding a constant model = sm.OLS(Y, X).fit() predictions = model.predict(X) # tkinter GUI root= tk.Tk() canvas1 = tk.Canvas(root, width = 1200, height = 450) canvas1.pack() # with sklearn Intercept_result = ('Intercept: ', regr.intercept_) label_Intercept = tk.Label(root, text=Intercept_result, justify = 'center') canvas1.create_window(260, 220, window=label_Intercept) # with sklearn Coefficients_result = ('Coefficients: ', regr.coef_) label_Coefficients = tk.Label(root, text=Coefficients_result, justify = 'center') canvas1.create_window(260, 240, window=label_Coefficients) # with statsmodels print_model = model.summary() label_model = tk.Label(root, text=print_model, justify = 'center', relief = 'solid', bg='LightSkyBlue1') canvas1.create_window(800, 220, window=label_model) # New_Interest_Rate label and input box label1 = tk.Label(root, text='Type Interest Rate: ') canvas1.create_window(100, 100, window=label1) entry1 = tk.Entry (root) # create 1st entry box canvas1.create_window(270, 100, window=entry1) # New_Unemployment_Rate label and input box label2 = tk.Label(root, text=' Type Unemployment Rate: ') canvas1.create_window(120, 120, window=label2) entry2 = tk.Entry (root) # create 2nd entry box canvas1.create_window(270, 120, window=entry2) def values(): global New_Interest_Rate #our 1st input variable New_Interest_Rate = float(entry1.get()) global New_Unemployment_Rate #our 2nd input variable New_Unemployment_Rate = float(entry2.get()) Prediction_result = ('Predicted Stock Index Price: ', regr.predict([[New_Interest_Rate ,New_Unemployment_Rate]])) label_Prediction = tk.Label(root, text= Prediction_result, bg='orange') canvas1.create_window(260, 280, window=label_Prediction) button1 = tk.Button (root, text='Predict Stock Index Price',command=values, bg='orange') # button to call the 'values' command above canvas1.create_window(270, 150, window=button1) root.mainloop()

And when you run the code, you’ll see this GUI:

The* left-hand-side* of the GUI displays the output generated by sklearn:

- It includes 2 input boxes, so that the user may type values for the interest and unemployment rates to get the predicted result
- It also includes the intercept and coefficients generated by sklearn

While, the *right-hand side* of the GUI displays the output generated by statsmodels.

Recall that earlier we made a prediction by using the following input:

- Interest Rate = 2.75
- Unemployment Rate = 5.3

Type those values in the input boxes, and then click on the button ‘Predict Stock Index Price:’

You’ll now see the predicted result of 1422.86, which matches with the value we saw before.

## Yet another GUI

To be gentle, the GUI that we just saw is not the most appealing one.

And so, in this section, I’ll share the code that will contain:

- the output generated by sklearn; and
- the scatter diagrams that we used earlier to check for linearity

And here is the full Python code:

from pandas import DataFrame from sklearn import linear_model import tkinter as tk import matplotlib.pyplot as plt from matplotlib.backends.backend_tkagg import FigureCanvasTkAgg Stock_Market = {'Year': [2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016], 'Month': [12, 11,10,9,8,7,6,5,4,3,2,1,12,11,10,9,8,7,6,5,4,3,2,1], 'Interest_Rate': [2.75,2.5,2.5,2.5,2.5,2.5,2.5,2.25,2.25,2.25,2,2,2,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75], 'Unemployment_Rate': [5.3,5.3,5.3,5.3,5.4,5.6,5.5,5.5,5.5,5.6,5.7,5.9,6,5.9,5.8,6.1,6.2,6.1,6.1,6.1,5.9,6.2,6.2,6.1], 'Stock_Index_Price': [1464,1394,1357,1293,1256,1254,1234,1195,1159,1167,1130,1075,1047,965,943,958,971,949,884,866,876,822,704,719] } df = DataFrame(Stock_Market,columns=['Year','Month','Interest_Rate','Unemployment_Rate','Stock_Index_Price']) X = df[['Interest_Rate','Unemployment_Rate']].astype(float) # here we have 2 input variables for multiple regression. If you just want to use one variable for simple linear regression, then use X = df['Interest_Rate'] for example.Alternatively, you may add additional variables within the brackets Y = df['Stock_Index_Price'].astype(float) # output variable (what we are trying to predict) # with sklearn regr = linear_model.LinearRegression() regr.fit(X, Y) print('Intercept: \n', regr.intercept_) print('Coefficients: \n', regr.coef_) # tkinter GUI root= tk.Tk() canvas1 = tk.Canvas(root, width = 500, height = 300) canvas1.pack() # with sklearn Intercept_result = ('Intercept: ', regr.intercept_) label_Intercept = tk.Label(root, text=Intercept_result, justify = 'center') canvas1.create_window(260, 220, window=label_Intercept) # with sklearn Coefficients_result = ('Coefficients: ', regr.coef_) label_Coefficients = tk.Label(root, text=Coefficients_result, justify = 'center') canvas1.create_window(260, 240, window=label_Coefficients) # New_Interest_Rate label and input box label1 = tk.Label(root, text='Type Interest Rate: ') canvas1.create_window(100, 100, window=label1) entry1 = tk.Entry (root) # create 1st entry box canvas1.create_window(270, 100, window=entry1) # New_Unemployment_Rate label and input box label2 = tk.Label(root, text=' Type Unemployment Rate: ') canvas1.create_window(120, 120, window=label2) entry2 = tk.Entry (root) # create 2nd entry box canvas1.create_window(270, 120, window=entry2) def values(): global New_Interest_Rate #our 1st input variable New_Interest_Rate = float(entry1.get()) global New_Unemployment_Rate #our 2nd input variable New_Unemployment_Rate = float(entry2.get()) Prediction_result = ('Predicted Stock Index Price: ', regr.predict([[New_Interest_Rate ,New_Unemployment_Rate]])) label_Prediction = tk.Label(root, text= Prediction_result, bg='orange') canvas1.create_window(260, 280, window=label_Prediction) button1 = tk.Button (root, text='Predict Stock Index Price',command=values, bg='orange') # button to call the 'values' command above canvas1.create_window(270, 150, window=button1) #plot 1st scatter figure3 = plt.Figure(figsize=(5,4), dpi=100) ax3 = figure3.add_subplot(111) ax3.scatter(df['Interest_Rate'].astype(float),df['Stock_Index_Price'].astype(float), color = 'r') scatter3 = FigureCanvasTkAgg(figure3, root) scatter3.get_tk_widget().pack(side=tk.RIGHT, fill=tk.BOTH) ax3.legend() ax3.set_xlabel('Interest Rate') ax3.set_title('Interest Rate Vs. Stock Index Price') #plot 2nd scatter figure4 = plt.Figure(figsize=(5,4), dpi=100) ax4 = figure4.add_subplot(111) ax4.scatter(df['Unemployment_Rate'].astype(float),df['Stock_Index_Price'].astype(float), color = 'g') scatter4 = FigureCanvasTkAgg(figure4, root) scatter4.get_tk_widget().pack(side=tk.RIGHT, fill=tk.BOTH) ax4.legend() ax4.set_xlabel('Unemployment_Rate') ax4.set_title('Unemployment_Rate Vs. Stock Index Price') root.mainloop()

Once you run the code, you would get the GUI below:

You may want to check the following source to learn more about embedding charts on a tkinter GUI.

## Conclusion

Linear regression is often used in Machine Learning. We have seen some examples of how to perform multiple linear regression in Python using both sklearn and statsmodels.

Before applying linear regression models, make sure to check that a *linear* relationship exists between the dependent variable (i.e., what you are trying to predict) and the independent variable/s (i.e., the input variable/s).