In this guide, you’ll see how to perform a linear regression in Python using statsmodels.
Here are the topics to be reviewed:
- Background about linear regression
- Review of an example with the full dataset
- Review of the Python code
- Interpretation of the regression results
About Linear Regression
Linear regression is used as a predictive model that assumes a linear relationship between the dependent variable (which is the variable we are trying to predict/estimate) and the independent variable/s (input variable/s used in the prediction).
Under Simple Linear Regression, only one independent/input variable is used to predict the dependent variable. It has the following structure:
Y = C + M*X
- Y = Dependent variable (output/outcome/prediction/estimation)
- C = Constant (Y-Intercept)
- M = Slope of the regression line (the effect that X has on Y)
- X = Independent variable (input variable used in the prediction of Y)
In reality, a relationship may exist between the dependent variable and multiple independent variables. For these types of models (assuming linearity), we can use Multiple Linear Regression with the following structure:
Y = C + M1*X1 + M2*X2 + …
An Example (with the full dataset)
For illustration purposes, let’s suppose that you have a fictitious economy with the following parameters, where the index_price is the dependent variable, and the 2 independent/input variables are:
- interest_rate
- unemployment_rate
We will use Pandas DataFrame to capture the data in Python:
import pandas as pd data = {'year': [2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016], 'month': [12,11,10,9,8,7,6,5,4,3,2,1,12,11,10,9,8,7,6,5,4,3,2,1], 'interest_rate': [2.75,2.5,2.5,2.5,2.5,2.5,2.5,2.25,2.25,2.25,2,2,2,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75], 'unemployment_rate': [5.3,5.3,5.3,5.3,5.4,5.6,5.5,5.5,5.5,5.6,5.7,5.9,6,5.9,5.8,6.1,6.2,6.1,6.1,6.1,5.9,6.2,6.2,6.1], 'index_price': [1464,1394,1357,1293,1256,1254,1234,1195,1159,1167,1130,1075,1047,965,943,958,971,949,884,866,876,822,704,719] } df = pd.DataFrame(data) print(df)
Here is the full dataset:
year month interest_rate unemployment_rate index_price
0 2017 12 2.75 5.3 1464
1 2017 11 2.50 5.3 1394
2 2017 10 2.50 5.3 1357
3 2017 9 2.50 5.3 1293
4 2017 8 2.50 5.4 1256
5 2017 7 2.50 5.6 1254
6 2017 6 2.50 5.5 1234
7 2017 5 2.25 5.5 1195
8 2017 4 2.25 5.5 1159
9 2017 3 2.25 5.6 1167
10 2017 2 2.00 5.7 1130
11 2017 1 2.00 5.9 1075
12 2016 12 2.00 6.0 1047
13 2016 11 1.75 5.9 965
14 2016 10 1.75 5.8 943
15 2016 9 1.75 6.1 958
16 2016 8 1.75 6.2 971
17 2016 7 1.75 6.1 949
18 2016 6 1.75 6.1 884
19 2016 5 1.75 6.1 866
20 2016 4 1.75 5.9 876
21 2016 3 1.75 6.2 822
22 2016 2 1.75 6.2 704
23 2016 1 1.75 6.1 719
The Python Code using Statsmodels
Now let’s apply the following syntax to perform the linear regression in Python using statsmodels:
import pandas as pd import statsmodels.api as sm data = {'year': [2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016], 'month': [12,11,10,9,8,7,6,5,4,3,2,1,12,11,10,9,8,7,6,5,4,3,2,1], 'interest_rate': [2.75,2.5,2.5,2.5,2.5,2.5,2.5,2.25,2.25,2.25,2,2,2,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75], 'unemployment_rate': [5.3,5.3,5.3,5.3,5.4,5.6,5.5,5.5,5.5,5.6,5.7,5.9,6,5.9,5.8,6.1,6.2,6.1,6.1,6.1,5.9,6.2,6.2,6.1], 'index_price': [1464,1394,1357,1293,1256,1254,1234,1195,1159,1167,1130,1075,1047,965,943,958,971,949,884,866,876,822,704,719] } df = pd.DataFrame(data) x = df[['interest_rate','unemployment_rate']] y = df['index_price'] x = sm.add_constant(x) model = sm.OLS(y, x).fit() predictions = model.predict(x) print_model = model.summary() print(print_model)
This is the result that you’ll get once you run the code in Python:
OLS Regression Results
==============================================================================
Dep. Variable: index_price R-squared: 0.898
Model: OLS Adj. R-squared: 0.888
Method: Least Squares F-statistic: 92.07
Date: Sat, 30 Jul 2022 Prob (F-statistic): 4.04e-11
Time: 13:24:29 Log-Likelihood: -134.61
No. Observations: 24 AIC: 275.2
Df Residuals: 21 BIC: 278.8
Df Model: 2
Covariance Type: nonrobust
=====================================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------------
const 1798.4040 899.248 2.000 0.059 -71.685 3668.493
interest_rate 345.5401 111.367 3.103 0.005 113.940 577.140
unemployment_rate -250.1466 117.950 -2.121 0.046 -495.437 -4.856
==============================================================================
Omnibus: 2.691 Durbin-Watson: 0.530
Prob(Omnibus): 0.260 Jarque-Bera (JB): 1.551
Skew: -0.612 Prob(JB): 0.461
Kurtosis: 3.226 Cond. No. 394.
==============================================================================
Interpreting the Regression Results
Highlighted (in yellow above) several important components within the results:
- Adjusted. R-squared reflects the fit of the model. R-squared values range from 0 to 1, where a higher value generally indicates a better fit, assuming certain conditions are met.
- const coefficient is your Y-intercept. It means that if both the interest_rate and unemployment_rate coefficients are zero, then the expected output (i.e., the Y) would be equal to the const coefficient.
- interest_rate coefficient represents the change in the output Y due to a change of one unit in the interest rate (everything else held constant)
- unemployment_rate coefficient represents the change in the output Y due to a change of one unit in the unemployment rate (everything else held constant)
- std err reflects the level of accuracy of the coefficients. The lower it is, the higher is the level of accuracy
- P >|t| is your p-value. A p-value of less than 0.05 is considered to be statistically significant
- Confidence Interval represents the range in which our coefficients are likely to fall (with a likelihood of 95%)
You may want to check the following tutorial that includes an example of multiple linear regression using both sklearn and statsmodels.
For further information about statsmodels, please refer to the statsmodels documentation.