Multiple Linear Regression in Python

In this tutorial, you will learn how to perform a multiple linear regression in Python.

TLDR solution

linear_regression.py
import pandas as pd
import statsmodels.api as sm

df = pd.DataFrame(data)

X = df[['x1', 'x2']]
y = df['y']

X = sm.add_constant(X)
model = sm.OLS(y, X).fit()
predictions_statsmodels = model.predict(X)
summary = model.summary()
print(summary)

Step-by-Step Example

Step 1: Install pandas and statsmodels

If you don't have pandas and statsmodels already installed, execute the following command in your terminal:

pip install pandas statsmodels

Step 2: Look at Your Data

For demonstration purposes, let's work with fish market data which you can download by clicking here. Import it and have a first look at the raw data:

import pandas as pd

df = pd.read_csv("fishmarket.csv")

print(df.shape)
print(df.head())

The output should look like this:

(159, 7)
  Species  Weight  Length1  Length2  Length3   Height   Width
0   Bream   242.0     23.2     25.4     30.0  11.5200  4.0200
1   Bream   290.0     24.0     26.3     31.2  12.4800  4.3056
2   Bream   340.0     23.9     26.5     31.1  12.3778  4.6961
3   Bream   363.0     26.3     29.0     33.5  12.7300  4.4555
4   Bream   430.0     26.5     29.0     34.0  12.4440  5.1340

The dataset has 159 entries recording the fish species (categorical values!), its weight, three lengths dimensions (vertical, diagonal, cross), height and width.

Step 2: Run a Linear Regression

Let's say, you want to predict the weight of a fish from the other variables, i.e,. your linear regression model:

Weight = beta_0 + beta_1*Bream + ... + beta_6*Width

Note that you need to dummify/one hot encode the categorical variable. You also need to drop one of the dummies to avoid the multicollinearity problem. You should therefore also drop two of the three length variables. Thus, you can then run the regression as follows:

import statsmodels.api as sm

X_dummies = pd.get_dummies(df['Species'])
X_dummies = X_dummies.iloc[:,:-1] # to avoid multicollinearity
X_rest = df[['Length1', 'Height', 'Width']]

X = pd.concat([X_dummies, X_rest], axis=1)
y = df['Weight']

X = sm.add_constant(X)
model = sm.OLS(y, X).fit()
predictions_statsmodels = model.predict(X)
summary = model.summary()

print(summary)

The output of the code should look like this:

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 Weight   R-squared:                       0.931
Model:                            OLS   Adj. R-squared:                  0.927
Method:                 Least Squares   F-statistic:                     223.0
Date:                Fri, 15 Nov 2024   Prob (F-statistic):           9.38e-82
Time:                        22:00:11   Log-Likelihood:                -947.68
No. Observations:                 159   AIC:                             1915.
Df Residuals:                     149   BIC:                             1946.
Df Model:                           9                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       -704.4445     53.988    -13.048      0.000    -811.126    -597.763
Bream        -39.0066     79.642     -0.490      0.625    -196.381     118.368
Parkki        23.8917     61.643      0.388      0.699     -97.915     145.698
Perch         -2.4141     44.019     -0.055      0.956     -89.395      84.567
Pike        -299.6051     86.700     -3.456      0.001    -470.925    -128.286
Roach        -22.1006     46.939     -0.471      0.638    -114.852      70.651
Smelt        256.8682     57.464      4.470      0.000     143.318     370.418
Length1       37.9353      4.010      9.459      0.000      30.011      45.860
Height        13.3419     13.256      1.006      0.316     -12.852      39.536
Width          1.6677     24.478      0.068      0.946     -46.702      50.037
==============================================================================
Omnibus:                       38.971   Durbin-Watson:                   0.825
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               82.558
Skew:                           1.081   Prob(JB):                     1.18e-18
Kurtosis:                       5.791   Cond. No.                         458.
==============================================================================

That's it! You just learned how to run a linear regression using Python.