In this short guide, you’ll see an example of multiple linear regression in R.
Here are the topics to be reviewed:
- Collecting and capturing the data in R
- Checking for linearity
- Applying the multiple linear regression model in R
The Steps
Step 1: Collect and capture the data in R
Imagine that you have a fictitious economy, and your goal is to predict the index_price (the dependent variable) based on two independent/input variables:
- interest_rate
- unemployment_rate
To capture the full dataset in R:
year <- c(2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016)
month <- c(12,11,10,9,8,7,6,5,4,3,2,1,12,11,10,9,8,7,6,5,4,3,2,1)
interest_rate <- c(2.75,2.5,2.5,2.5,2.5,2.5,2.5,2.25,2.25,2.25,2,2,2,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75)
unemployment_rate <- c(5.3,5.3,5.3,5.3,5.4,5.6,5.5,5.5,5.5,5.6,5.7,5.9,6,5.9,5.8,6.1,6.2,6.1,6.1,6.1,5.9,6.2,6.2,6.1)
index_price <- c(1464,1394,1357,1293,1256,1254,1234,1195,1159,1167,1130,1075,1047,965,943,958,971,949,884,866,876,822,704,719)
Step 2: Check for linearity
Before you apply a linear regression model, you’ll need to verify that a linear relationship exists between the dependent variable and the independent variable/s.
Here, the goal is to check that a linear relationship exists between:
- The index_price (dependent variable) and the interest_rate (independent variable); and
- The index_price (dependent variable) and the unemployment_rate (independent variable)
A quick way to check for linearity is by using scatter plots.
To plot the relationship between the index_price and the interest_rate:
year <- c(2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016)
month <- c(12,11,10,9,8,7,6,5,4,3,2,1,12,11,10,9,8,7,6,5,4,3,2,1)
interest_rate <- c(2.75,2.5,2.5,2.5,2.5,2.5,2.5,2.25,2.25,2.25,2,2,2,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75)
unemployment_rate <- c(5.3,5.3,5.3,5.3,5.4,5.6,5.5,5.5,5.5,5.6,5.7,5.9,6,5.9,5.8,6.1,6.2,6.1,6.1,6.1,5.9,6.2,6.2,6.1)
index_price <- c(1464,1394,1357,1293,1256,1254,1234,1195,1159,1167,1130,1075,1047,965,943,958,971,949,884,866,876,822,704,719)
plot(x = interest_rate, y = index_price)
Notice that a linear relationship exists between the index_price and the interest_rate. Specifically, when interest rates go up, the index price also goes up.
To plot the relationship between the index_price and the unemployment_rate:
year <- c(2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016)
month <- c(12,11,10,9,8,7,6,5,4,3,2,1,12,11,10,9,8,7,6,5,4,3,2,1)
interest_rate <- c(2.75,2.5,2.5,2.5,2.5,2.5,2.5,2.25,2.25,2.25,2,2,2,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75)
unemployment_rate <- c(5.3,5.3,5.3,5.3,5.4,5.6,5.5,5.5,5.5,5.6,5.7,5.9,6,5.9,5.8,6.1,6.2,6.1,6.1,6.1,5.9,6.2,6.2,6.1)
index_price <- c(1464,1394,1357,1293,1256,1254,1234,1195,1159,1167,1130,1075,1047,965,943,958,971,949,884,866,876,822,704,719)
plot(x = unemployment_rate, y = index_price)
You’ll now see that a linear relationship also exists between the index_price and the unemployment_rate – when the unemployment rates go up, the index price goes down (here you still have a linear relationship, but with a negative slope).
Step 3: Apply the multiple linear regression in R
Use the following template to perform the multiple linear regression in R:
model <- lm(Dependent variable ~ First independent Variable + Second independent variable + ...)
summary(model)
Here is the full code to apply the multiple linear regression in R:
year <- c(2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016)
month <- c(12,11,10,9,8,7,6,5,4,3,2,1,12,11,10,9,8,7,6,5,4,3,2,1)
interest_rate <- c(2.75,2.5,2.5,2.5,2.5,2.5,2.5,2.25,2.25,2.25,2,2,2,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75)
unemployment_rate <- c(5.3,5.3,5.3,5.3,5.4,5.6,5.5,5.5,5.5,5.6,5.7,5.9,6,5.9,5.8,6.1,6.2,6.1,6.1,6.1,5.9,6.2,6.2,6.1)
index_price <- c(1464,1394,1357,1293,1256,1254,1234,1195,1159,1167,1130,1075,1047,965,943,958,971,949,884,866,876,822,704,719)
model <- lm(index_price ~ interest_rate + unemployment_rate)
summary(model)
Once you run the code, you’ll get the following summary:
Call:
lm(formula = index_price ~ interest_rate + unemployment_rate)
Residuals:
Min 1Q Median 3Q Max
-158.205 -41.667 -6.248 57.741 118.810
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1798.4 899.2 2.000 0.05861 .
interest_rate 345.5 111.4 3.103 0.00539 **
unemployment_rate -250.1 117.9 -2.121 0.04601 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 70.56 on 21 degrees of freedom
Multiple R-squared: 0.8976, Adjusted R-squared: 0.8879
F-statistic: 92.07 on 2 and 21 DF, p-value: 4.043e-11
You can use the coefficients in the summary above (as highlighted in yellow) in order to build the multiple linear regression equation as follows:
index_price = (Intercept) + (interest_rate coef)*X1 (unemployment_rate coef)*X2
And once you plug the numbers from the summary:
index_price = (1798.4) + (345.5)*X1 + (-250.1)*X2
Some additional stats to consider in the summary:
- Adjusted R-squared reflects the fit of the model, where a higher value generally indicates a better fit
- Intercept coefficient is the Y-intercept
- interest_rate coefficient is the change in Y due to a change of one unit in the interest rate (everything else held constant)
- unemployment_rate coefficient is the change in Y due to a change of one unit in the unemployment rate (everything else held constant)
- Std. Error reflects the level of accuracy of the coefficients
- Pr(>|t|) is the p-value. A p-value of less than 0.05 is considered to be statistically significant