In this short guide, you’ll see how to perform a linear regression in Python using *statsmodels.*

Here are the topics to be reviewed:

- Background about linear regression
- Review of an example with the full dataset
- Review of the Python code
- Interpretation of the regression results

## Background

Linear regression models assume a *linear* relationship between the **dependent variable** (which is the variable you are trying to predict/estimate) and the **independent variable/s** (the input variable/s used in the prediction).

Under a **Simple Linear Regression**, only *one* independent/input variable is used to predict the dependent variable. It has the following structure:

*Y = C + M*X*

**Y**= Dependent variable (output/outcome/prediction/estimation)**C**= Constant (Y-Intercept)**M**= Slope of the regression line (the effect that X has on Y)**X**= Independent variable (input variable used in the prediction of Y)

In reality, a relationship may exist between the dependent variable and *multiple* independent variables. For these types of models (assuming linearity), you may use Multiple Linear Regression with the following structure:

*Y = C + M _{1}*X_{1} + M_{2}*X_{2} + …*

## The Example

For illustration purposes, suppose that you have a fictitious economy, where the index_price is the dependent variable, and the 2 independent/input variables are:

- interest_rate
- unemployment_rate

Use Pandas DataFrame to capture the data in Python:

import pandas as pd

data = {"year": [2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016],

"month": [12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1],

"interest_rate": [2.75, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.25, 2.25, 2.25, 2, 2, 2, 1.75, 1.75, 1.75, 1.75, 1.75, 1.75, 1.75, 1.75, 1.75, 1.75, 1.75],

"unemployment_rate": [5.3, 5.3, 5.3, 5.3, 5.4, 5.6, 5.5, 5.5, 5.5, 5.6, 5.7, 5.9, 6, 5.9, 5.8, 6.1, 6.2, 6.1, 6.1, 6.1, 5.9, 6.2, 6.2, 6.1],

"index_price": [1464, 1394, 1357, 1293, 1256, 1254, 1234, 1195, 1159, 1167, 1130, 1075, 1047, 965, 943, 958, 971, 949, 884, 866, 876, 822, 704, 719]

}

df = pd.DataFrame(data)

print(df)

Here is the full dataset:

```
year month interest_rate unemployment_rate index_price
0 2017 12 2.75 5.3 1464
1 2017 11 2.50 5.3 1394
2 2017 10 2.50 5.3 1357
3 2017 9 2.50 5.3 1293
4 2017 8 2.50 5.4 1256
5 2017 7 2.50 5.6 1254
6 2017 6 2.50 5.5 1234
7 2017 5 2.25 5.5 1195
8 2017 4 2.25 5.5 1159
9 2017 3 2.25 5.6 1167
10 2017 2 2.00 5.7 1130
11 2017 1 2.00 5.9 1075
12 2016 12 2.00 6.0 1047
13 2016 11 1.75 5.9 965
14 2016 10 1.75 5.8 943
15 2016 9 1.75 6.1 958
16 2016 8 1.75 6.2 971
17 2016 7 1.75 6.1 949
18 2016 6 1.75 6.1 884
19 2016 5 1.75 6.1 866
20 2016 4 1.75 5.9 876
21 2016 3 1.75 6.2 822
22 2016 2 1.75 6.2 704
23 2016 1 1.75 6.1 719
```

## The Python Code using Statsmodels

First, install the statsmodels package:

pip install statsmodels

Here is the complete code to perform the linear regression in Python using statsmodels:

import pandas as pd

import statsmodels.api as sm

data = {"year": [2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016],

"month": [12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1],

"interest_rate": [2.75, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.25, 2.25, 2.25, 2, 2, 2, 1.75, 1.75, 1.75, 1.75, 1.75, 1.75, 1.75, 1.75, 1.75, 1.75, 1.75],

"unemployment_rate": [5.3, 5.3, 5.3, 5.3, 5.4, 5.6, 5.5, 5.5, 5.5, 5.6, 5.7, 5.9, 6, 5.9, 5.8, 6.1, 6.2, 6.1, 6.1, 6.1, 5.9, 6.2, 6.2, 6.1],

"index_price": [1464, 1394, 1357, 1293, 1256, 1254, 1234, 1195, 1159, 1167, 1130, 1075, 1047, 965, 943, 958, 971, 949, 884, 866, 876, 822, 704, 719]

}

df = pd.DataFrame(data)

x = df[["interest_rate", "unemployment_rate"]]

y = df["index_price"]

x = sm.add_constant(x)

model = sm.OLS(y, x).fit()

predictions = model.predict(x)

summary = model.summary()

print(summary)

The results:

```
OLS Regression Results
==============================================================================
Dep. Variable: index_price R-squared: 0.898
Model: OLS Adj. R-squared: 0.888
Method: Least Squares F-statistic: 92.07
Date: Sat, 30 Jul 2022 Prob (F-statistic): 4.04e-11
Time: 13:24:29 Log-Likelihood: -134.61
No. Observations: 24 AIC: 275.2
Df Residuals: 21 BIC: 278.8
Df Model: 2
Covariance Type: nonrobust
=====================================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------------
const 1798.4040 899.248 2.000 0.059 -71.685 3668.493
interest_rate 345.5401 111.367 3.103 0.005 113.940 577.140
unemployment_rate -250.1466 117.950 -2.121 0.046 -495.437 -4.856
==============================================================================
Omnibus: 2.691 Durbin-Watson: 0.530
Prob(Omnibus): 0.260 Jarque-Bera (JB): 1.551
Skew: -0.612 Prob(JB): 0.461
Kurtosis: 3.226 Cond. No. 394.
==============================================================================
```

## Interpreting the Regression Results

Highlighted (in yellow above) several important components within the results:

**Adjusted. R-squared**reflects the fit of the model. R-squared values range from 0 to 1, where a higher value generally indicates a better fit, assuming certain conditions are met.**const coefficient**is your Y-intercept. It means that if both the interest_rate and unemployment_rate coefficients are zero, then the expected output (i.e., the Y) would be equal to the const coefficient.**interest_rate coefficient**represents the change in the output Y due to a change of one unit in the interest rate (everything else held constant)**unemployment_rate coefficient**represents the change in the output Y due to a change of one unit in the unemployment rate (everything else held constant)**std err**reflects the level of accuracy of the coefficients. The lower it is, the higher is the level of accuracy**P >|t|**is your*p-value*. A p-value of less than 0.05 is considered to be statistically significant**Confidence Interval**represents the range in which our coefficients are likely to fall (with a likelihood of 95%)

Check the following tutorial that includes an example of multiple linear regression using both sklearn and statsmodels.

For further information about *statsmodels**, *please refer to the statsmodels documentation.