In this guide, you’ll see how to perform multiple linear regression in Python using both sklearn and statsmodels.
The Example
In the following example, you’ll see how to perform multiple linear regression for a fictitious economy, where the index_price is the dependent variable, and the 2 independent/input variables are:
- interest_rate
- unemployment_rate
Please note that you’ll have to validate that a linear relationship exists between the dependent variable and the independent variable/s (more on that under the checking for linearity section).
To start, here is the full dataset that includes the dependent and independent variables:
import pandas as pd
data = {"year": [2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016],
"month": [12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1],
"interest_rate": [2.75, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.25, 2.25, 2.25, 2, 2, 2, 1.75, 1.75, 1.75, 1.75, 1.75, 1.75, 1.75, 1.75, 1.75, 1.75, 1.75],
"unemployment_rate": [5.3, 5.3, 5.3, 5.3, 5.4, 5.6, 5.5, 5.5, 5.5, 5.6, 5.7, 5.9, 6, 5.9, 5.8, 6.1, 6.2, 6.1, 6.1, 6.1, 5.9, 6.2, 6.2, 6.1],
"index_price": [1464, 1394, 1357, 1293, 1256, 1254, 1234, 1195, 1159, 1167, 1130, 1075, 1047, 965, 943, 958, 971, 949, 884, 866, 876, 822, 704, 719]
}
df = pd.DataFrame(data)
print(df)
The result:
year month interest_rate unemployment_rate index_price
0 2017 12 2.75 5.3 1464
1 2017 11 2.50 5.3 1394
2 2017 10 2.50 5.3 1357
3 2017 9 2.50 5.3 1293
4 2017 8 2.50 5.4 1256
5 2017 7 2.50 5.6 1254
6 2017 6 2.50 5.5 1234
7 2017 5 2.25 5.5 1195
8 2017 4 2.25 5.5 1159
9 2017 3 2.25 5.6 1167
10 2017 2 2.00 5.7 1130
11 2017 1 2.00 5.9 1075
12 2016 12 2.00 6.0 1047
13 2016 11 1.75 5.9 965
14 2016 10 1.75 5.8 943
15 2016 9 1.75 6.1 958
16 2016 8 1.75 6.2 971
17 2016 7 1.75 6.1 949
18 2016 6 1.75 6.1 884
19 2016 5 1.75 6.1 866
20 2016 4 1.75 5.9 876
21 2016 3 1.75 6.2 822
22 2016 2 1.75 6.2 704
23 2016 1 1.75 6.1 719
Checking for Linearity
Next, check that a linear relationship exists between the:
- index_price (dependent variable) and interest_rate (independent variable)
- index_price (dependent variable) and unemployment_rate (independent variable)
You can use scatter diagrams to perform a quick linearity check (utilizing the Matplotlib library).
To plot the relationship between the index_price and the interest_rate:
import pandas as pd
import matplotlib.pyplot as plt
data = {"year": [2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016],
"month": [12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1],
"interest_rate": [2.75, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.25, 2.25, 2.25, 2, 2, 2, 1.75, 1.75, 1.75, 1.75, 1.75, 1.75, 1.75, 1.75, 1.75, 1.75, 1.75],
"unemployment_rate": [5.3, 5.3, 5.3, 5.3, 5.4, 5.6, 5.5, 5.5, 5.5, 5.6, 5.7, 5.9, 6, 5.9, 5.8, 6.1, 6.2, 6.1, 6.1, 6.1, 5.9, 6.2, 6.2, 6.1],
"index_price": [1464, 1394, 1357, 1293, 1256, 1254, 1234, 1195, 1159, 1167, 1130, 1075, 1047, 965, 943, 958, 971, 949, 884, 866, 876, 822, 704, 719]
}
df = pd.DataFrame(data)
plt.scatter(df["interest_rate"], df["index_price"], color="red")
plt.title("Index Price Vs Interest Rate", fontsize=14)
plt.xlabel("Interest Rate", fontsize=14)
plt.ylabel("Index Price", fontsize=14)
plt.grid(True)
plt.show()
Notice that indeed a linear relationship exists between the index_price and the interest_rate. Specifically, when interest rates go up, the index price also goes up.
To plot the relationship between the index_price and the unemployment_rate:
import pandas as pd
import matplotlib.pyplot as plt
data = {"year": [2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016],
"month": [12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1],
"interest_rate": [2.75, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.25, 2.25, 2.25, 2, 2, 2, 1.75, 1.75, 1.75, 1.75, 1.75, 1.75, 1.75, 1.75, 1.75, 1.75, 1.75],
"unemployment_rate": [5.3, 5.3, 5.3, 5.3, 5.4, 5.6, 5.5, 5.5, 5.5, 5.6, 5.7, 5.9, 6, 5.9, 5.8, 6.1, 6.2, 6.1, 6.1, 6.1, 5.9, 6.2, 6.2, 6.1],
"index_price": [1464, 1394, 1357, 1293, 1256, 1254, 1234, 1195, 1159, 1167, 1130, 1075, 1047, 965, 943, 958, 971, 949, 884, 866, 876, 822, 704, 719]
}
df = pd.DataFrame(data)
plt.scatter(df["unemployment_rate"], df["index_price"], color="green")
plt.title("Index Price Vs Unemployment Rate", fontsize=14)
plt.xlabel("Unemployment Rate", fontsize=14)
plt.ylabel("Index Price", fontsize=14)
plt.grid(True)
plt.show()
You’ll notice that a linear relationship also exists between the index_price and the unemployment_rate – when the unemployment rates go up, the index price goes down (here you still have a linear relationship, but with a negative slope).
Performing the Multiple Linear Regression in Python
Once you added the data into Python, you may use either sklearn or statsmodels to get the regression results. Either method would work, but you’ll see both methods for illustration purposes.
First, install the sklearn package:
pip install scikit-learn
Then, install the statsmodels package:
pip install statsmodels
Here is the complete code to perform the multiple linear regression in Python:
import pandas as pd
from sklearn import linear_model
import statsmodels.api as sm
data = {"year": [2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016],
"month": [12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1],
"interest_rate": [2.75, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.25, 2.25, 2.25, 2, 2, 2, 1.75, 1.75, 1.75, 1.75, 1.75, 1.75, 1.75, 1.75, 1.75, 1.75, 1.75],
"unemployment_rate": [5.3, 5.3, 5.3, 5.3, 5.4, 5.6, 5.5, 5.5, 5.5, 5.6, 5.7, 5.9, 6, 5.9, 5.8, 6.1, 6.2, 6.1, 6.1, 6.1, 5.9, 6.2, 6.2, 6.1],
"index_price": [1464, 1394, 1357, 1293, 1256, 1254, 1234, 1195, 1159, 1167, 1130, 1075, 1047, 965, 943, 958, 971, 949, 884, 866, 876, 822, 704, 719]
}
df = pd.DataFrame(data)
x = df[["interest_rate", "unemployment_rate"]]
y = df["index_price"]
# Using sklearn
regression = linear_model.LinearRegression()
regression.fit(x, y)
predictions_sklearn = regression.predict(x)
print("Intercept: \n", regression.intercept_)
print("Coefficients: \n", regression.coef_)
# Using statsmodels
x = sm.add_constant(x)
model = sm.OLS(y, x).fit()
predictions_statsmodels = model.predict(x)
summary = model.summary()
print(summary)
Once you run the code in Python, you’ll observe two parts:
(1) The first part shows the output generated by sklearn:
Intercept:
1798.4039776258564
Coefficients:
[ 345.54008701 -250.14657137]
This output includes the intercept and coefficients. You can use this information to build the multiple linear regression equation as follows:
index_price = (intercept) + (interest_rate coef)*X1 + (unemployment_rate coef)*X2
And once you plug the numbers:
index_price = (1798.4040) + (345.5401)*X1 + (-250.1466)*X2
(2) The second part displays a comprehensive table with statistical info generated by statsmodels.
This information can provide additional insights about the model used (such as the fit of the model, standard errors, etc):
OLS Regression Results
==============================================================================
Dep. Variable: index_price R-squared: 0.898
Model: OLS Adj. R-squared: 0.888
Method: Least Squares F-statistic: 92.07
Date: Sat, 30 Jul 2022 Prob (F-statistic): 4.04e-11
Time: 13:47:01 Log-Likelihood: -134.61
No. Observations: 24 AIC: 275.2
Df Residuals: 21 BIC: 278.8
Df Model: 2
Covariance Type: nonrobust
=====================================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------------
const 1798.4040 899.248 2.000 0.059 -71.685 3668.493
interest_rate 345.5401 111.367 3.103 0.005 113.940 577.140
unemployment_rate -250.1466 117.950 -2.121 0.046 -495.437 -4.856
==============================================================================
Omnibus: 2.691 Durbin-Watson: 0.530
Prob(Omnibus): 0.260 Jarque-Bera (JB): 1.551
Skew: -0.612 Prob(JB): 0.461
Kurtosis: 3.226 Cond. No. 394.
==============================================================================
Notice that the coefficients captured in this table (highlighted in yellow) match with the coefficients generated by sklearn.
You got consistent results by applying both sklearn and statsmodels.
Conclusion
Linear regression is often used in Machine Learning. You have seen some examples of how to perform multiple linear regression in Python using both sklearn and statsmodels.
Before applying linear regression models, make sure to check that a linear relationship exists between the dependent variable (i.e., what you are trying to predict) and the independent variable/s (i.e., the input variable/s).