Linear Regression in Python using Statsmodels

In this short guide, you’ll see how to perform a linear regression in Python using statsmodels.

Here are the topics to be reviewed:

  • Background about linear regression
  • Review of an example with the full dataset
  • Review of the Python code
  • Interpretation of the regression results

Background

Linear regression models assume a linear relationship between the dependent variable (which is the variable you are trying to predict/estimate) and the independent variable/s (the input variable/s used in the prediction).

Under a Simple Linear Regression, only one independent/input variable is used to predict the dependent variable. It has the following structure:

Y = C + M*X

  • Y = Dependent variable (output/outcome/prediction/estimation)
  • C = Constant (Y-Intercept)
  • M = Slope of the regression line (the effect that X has on Y)
  • X = Independent variable (input variable used in the prediction of Y)

In reality, a relationship may exist between the dependent variable and multiple independent variables. For these types of models (assuming linearity), you may use Multiple Linear Regression with the following structure:

Y = C + M1*X1 + M2*X2 + …

The Example

For illustration purposes, suppose that you have a fictitious economy, where the index_price is the dependent variable, and the 2 independent/input variables are:

  • interest_rate
  • unemployment_rate

Use Pandas DataFrame to capture the data in Python:

import pandas as pd

data = {"year": [2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016],
"month": [12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1],
"interest_rate": [2.75, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.25, 2.25, 2.25, 2, 2, 2, 1.75, 1.75, 1.75, 1.75, 1.75, 1.75, 1.75, 1.75, 1.75, 1.75, 1.75],
"unemployment_rate": [5.3, 5.3, 5.3, 5.3, 5.4, 5.6, 5.5, 5.5, 5.5, 5.6, 5.7, 5.9, 6, 5.9, 5.8, 6.1, 6.2, 6.1, 6.1, 6.1, 5.9, 6.2, 6.2, 6.1],
"index_price": [1464, 1394, 1357, 1293, 1256, 1254, 1234, 1195, 1159, 1167, 1130, 1075, 1047, 965, 943, 958, 971, 949, 884, 866, 876, 822, 704, 719]
}

df = pd.DataFrame(data)

print(df)

Here is the full dataset:

    year  month  interest_rate  unemployment_rate  index_price
0   2017     12           2.75                5.3         1464
1   2017     11           2.50                5.3         1394
2   2017     10           2.50                5.3         1357
3   2017      9           2.50                5.3         1293
4   2017      8           2.50                5.4         1256
5   2017      7           2.50                5.6         1254
6   2017      6           2.50                5.5         1234
7   2017      5           2.25                5.5         1195
8   2017      4           2.25                5.5         1159
9   2017      3           2.25                5.6         1167
10  2017      2           2.00                5.7         1130
11  2017      1           2.00                5.9         1075
12  2016     12           2.00                6.0         1047
13  2016     11           1.75                5.9          965
14  2016     10           1.75                5.8          943
15  2016      9           1.75                6.1          958
16  2016      8           1.75                6.2          971
17  2016      7           1.75                6.1          949
18  2016      6           1.75                6.1          884
19  2016      5           1.75                6.1          866
20  2016      4           1.75                5.9          876
21  2016      3           1.75                6.2          822
22  2016      2           1.75                6.2          704
23  2016      1           1.75                6.1          719

The Python Code using Statsmodels

First, install the statsmodels package:

pip install statsmodels

Here is the complete code to perform the linear regression in Python using statsmodels:

import pandas as pd
import statsmodels.api as sm

data = {"year": [2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016],
"month": [12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1],
"interest_rate": [2.75, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.25, 2.25, 2.25, 2, 2, 2, 1.75, 1.75, 1.75, 1.75, 1.75, 1.75, 1.75, 1.75, 1.75, 1.75, 1.75],
"unemployment_rate": [5.3, 5.3, 5.3, 5.3, 5.4, 5.6, 5.5, 5.5, 5.5, 5.6, 5.7, 5.9, 6, 5.9, 5.8, 6.1, 6.2, 6.1, 6.1, 6.1, 5.9, 6.2, 6.2, 6.1],
"index_price": [1464, 1394, 1357, 1293, 1256, 1254, 1234, 1195, 1159, 1167, 1130, 1075, 1047, 965, 943, 958, 971, 949, 884, 866, 876, 822, 704, 719]
}

df = pd.DataFrame(data)

x = df[["interest_rate", "unemployment_rate"]]
y = df["index_price"]

x = sm.add_constant(x)

model = sm.OLS(y, x).fit()
predictions = model.predict(x)

summary = model.summary()
print(summary)

The results:

                            OLS Regression Results                            
==============================================================================
Dep. Variable:            index_price   R-squared:                       0.898
Model:                            OLS   Adj. R-squared:                  0.888
Method:                 Least Squares   F-statistic:                     92.07
Date:                Sat, 30 Jul 2022   Prob (F-statistic):           4.04e-11
Time:                        13:24:29   Log-Likelihood:                -134.61
No. Observations:                  24   AIC:                             275.2
Df Residuals:                      21   BIC:                             278.8
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
=====================================================================================
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
const              1798.4040    899.248      2.000      0.059     -71.685    3668.493
interest_rate       345.5401    111.367      3.103      0.005     113.940     577.140
unemployment_rate  -250.1466    117.950     -2.121      0.046    -495.437      -4.856
==============================================================================
Omnibus:                        2.691   Durbin-Watson:                   0.530
Prob(Omnibus):                  0.260   Jarque-Bera (JB):                1.551
Skew:                          -0.612   Prob(JB):                        0.461
Kurtosis:                       3.226   Cond. No.                         394.
==============================================================================

Interpreting the Regression Results

Highlighted (in yellow above) several important components within the results:

  1. Adjusted. R-squared reflects the fit of the model. R-squared values range from 0 to 1, where a higher value generally indicates a better fit, assuming certain conditions are met.
  2. const coefficient is your Y-intercept. It means that if both the interest_rate and unemployment_rate coefficients are zero, then the expected output (i.e., the Y) would be equal to the const coefficient.
  3. interest_rate coefficient represents the change in the output Y due to a change of one unit in the interest rate (everything else held constant)
  4. unemployment_rate coefficient represents the change in the output Y due to a change of one unit in the unemployment rate (everything else held constant)
  5. std err reflects the level of accuracy of the coefficients. The lower it is, the higher is the level of accuracy
  6. P >|t| is your p-value. A p-value of less than 0.05 is considered to be statistically significant
  7. Confidence Interval represents the range in which our coefficients are likely to fall (with a likelihood of 95%)

Check the following tutorial that includes an example of multiple linear regression using both sklearn and statsmodels.

For further information about statsmodelsplease refer to the statsmodels documentation.

Leave a Comment