Multiple Linear Regression in Python
In this tutorial, you will learn how to perform a multiple linear regression in Python.
TLDR solution
import pandas as pd
import statsmodels.api as sm
df = pd.DataFrame(data)
X = df[['x1', 'x2']]
y = df['y']
X = sm.add_constant(X)
model = sm.OLS(y, X).fit()
predictions_statsmodels = model.predict(X)
summary = model.summary()
print(summary)
Step-by-Step Example
Step 1: Install pandas and statsmodels
If you don't have pandas and statsmodels already installed, execute the following command in your terminal:
pip install pandas statsmodels
Step 2: Look at Your Data
For demonstration purposes, let's work with fish market data which you can download by clicking here. Import it and have a first look at the raw data:
import pandas as pd
df = pd.read_csv("fishmarket.csv")
print(df.shape)
print(df.head())
The output should look like this:
(159, 7)
Species Weight Length1 Length2 Length3 Height Width
0 Bream 242.0 23.2 25.4 30.0 11.5200 4.0200
1 Bream 290.0 24.0 26.3 31.2 12.4800 4.3056
2 Bream 340.0 23.9 26.5 31.1 12.3778 4.6961
3 Bream 363.0 26.3 29.0 33.5 12.7300 4.4555
4 Bream 430.0 26.5 29.0 34.0 12.4440 5.1340
The dataset has 159 entries recording the fish species (categorical values!), its weight, three lengths dimensions (vertical, diagonal, cross), height and width.
Step 2: Run a Linear Regression
Let's say, you want to predict the weight of a fish from the other variables, i.e,. your linear regression model:
Weight = beta_0 + beta_1*Bream + ... + beta_6*Width
Note that you need to dummify/one hot encode the categorical variable. You also need to drop one of the dummies to avoid the multicollinearity problem. You should therefore also drop two of the three length variables. Thus, you can then run the regression as follows:
import statsmodels.api as sm
X_dummies = pd.get_dummies(df['Species'])
X_dummies = X_dummies.iloc[:,:-1] # to avoid multicollinearity
X_rest = df[['Length1', 'Height', 'Width']]
X = pd.concat([X_dummies, X_rest], axis=1)
y = df['Weight']
X = sm.add_constant(X)
model = sm.OLS(y, X).fit()
predictions_statsmodels = model.predict(X)
summary = model.summary()
print(summary)
The output of the code should look like this:
OLS Regression Results
==============================================================================
Dep. Variable: Weight R-squared: 0.931
Model: OLS Adj. R-squared: 0.927
Method: Least Squares F-statistic: 223.0
Date: Fri, 15 Nov 2024 Prob (F-statistic): 9.38e-82
Time: 22:00:11 Log-Likelihood: -947.68
No. Observations: 159 AIC: 1915.
Df Residuals: 149 BIC: 1946.
Df Model: 9
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const -704.4445 53.988 -13.048 0.000 -811.126 -597.763
Bream -39.0066 79.642 -0.490 0.625 -196.381 118.368
Parkki 23.8917 61.643 0.388 0.699 -97.915 145.698
Perch -2.4141 44.019 -0.055 0.956 -89.395 84.567
Pike -299.6051 86.700 -3.456 0.001 -470.925 -128.286
Roach -22.1006 46.939 -0.471 0.638 -114.852 70.651
Smelt 256.8682 57.464 4.470 0.000 143.318 370.418
Length1 37.9353 4.010 9.459 0.000 30.011 45.860
Height 13.3419 13.256 1.006 0.316 -12.852 39.536
Width 1.6677 24.478 0.068 0.946 -46.702 50.037
==============================================================================
Omnibus: 38.971 Durbin-Watson: 0.825
Prob(Omnibus): 0.000 Jarque-Bera (JB): 82.558
Skew: 1.081 Prob(JB): 1.18e-18
Kurtosis: 5.791 Cond. No. 458.
==============================================================================
That's it! You just learned how to run a linear regression using Python.