In this tutorial, I’ll show you an example of multiple linear regression in R.

Here are the topics to be reviewed:

- Collecting the data
- Capturing the data in R
- Checking for linearity
- Applying the multiple linear regression model
- Making a prediction

## Steps to apply the multiple linear regression in R

### Step 1: Collect the data

So let’s start with a simple example where the goal is to predict the stock_index_price (the dependent variable) of a fictitious economy based on two independent/input variables:

- Interest_Rate
- Unemployment_Rate

Here is the data to be used for our example:

Year | Month | Interest_Rate | Unemployment_Rate | Stock_Index_Price |

2017 | 12 | 2.75 | 5.3 | 1464 |

2017 | 11 | 2.5 | 5.3 | 1394 |

2017 | 10 | 2.5 | 5.3 | 1357 |

2017 | 9 | 2.5 | 5.3 | 1293 |

2017 | 8 | 2.5 | 5.4 | 1256 |

2017 | 7 | 2.5 | 5.6 | 1254 |

2017 | 6 | 2.5 | 5.5 | 1234 |

2017 | 5 | 2.25 | 5.5 | 1195 |

2017 | 4 | 2.25 | 5.5 | 1159 |

2017 | 3 | 2.25 | 5.6 | 1167 |

2017 | 2 | 2 | 5.7 | 1130 |

2017 | 1 | 2 | 5.9 | 1075 |

2016 | 12 | 2 | 6 | 1047 |

2016 | 11 | 1.75 | 5.9 | 965 |

2016 | 10 | 1.75 | 5.8 | 943 |

2016 | 9 | 1.75 | 6.1 | 958 |

2016 | 8 | 1.75 | 6.2 | 971 |

2016 | 7 | 1.75 | 6.1 | 949 |

2016 | 6 | 1.75 | 6.1 | 884 |

2016 | 5 | 1.75 | 6.1 | 866 |

2016 | 4 | 1.75 | 5.9 | 876 |

2016 | 3 | 1.75 | 6.2 | 822 |

2016 | 2 | 1.75 | 6.2 | 704 |

2016 | 1 | 1.75 | 6.1 | 719 |

### Step 2: Capture the data in R

Next, you’ll need to capture the above data in R. The following code can be used to accomplish this task:

Year <- c(2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016) Month <- c(12, 11,10,9,8,7,6,5,4,3,2,1,12,11,10,9,8,7,6,5,4,3,2,1) Interest_Rate <- c(2.75,2.5,2.5,2.5,2.5,2.5,2.5,2.25,2.25,2.25,2,2,2,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75) Unemployment_Rate <- c(5.3,5.3,5.3,5.3,5.4,5.6,5.5,5.5,5.5,5.6,5.7,5.9,6,5.9,5.8,6.1,6.2,6.1,6.1,6.1,5.9,6.2,6.2,6.1) Stock_Index_Price <- c(1464,1394,1357,1293,1256,1254,1234,1195,1159,1167,1130,1075,1047,965,943,958,971,949,884,866,876,822,704,719)

Realistically speaking, when dealing with a large amount of data, it is sometimes more practical to import that data into R. In the last section of this tutorial, I’ll show you how to import the data from a CSV file.

### Step 3: Check for linearity

Before you apply linear regression models, you’ll need to verify that several assumptions are met. Most notably, you’ll need to make sure that a linear regression exists between the dependent variable and the independent variable/s.

A quick way to check for linearity is by using scatter plots.

For our example, we’ll check that a linear relationship exists between:

- The Stock_Index_Price (dependent variable) and the Interest_Rate (independent variable); and
- The Stock_Index_Price (dependent variable) and the Unemployment_Rate (independent variable)

Here is the code that can be used in R to plot the relationship between the Stock_Index_Price and the Interest_Rate:

Year <- c(2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016) Month <- c(12, 11,10,9,8,7,6,5,4,3,2,1,12,11,10,9,8,7,6,5,4,3,2,1) Interest_Rate <- c(2.75,2.5,2.5,2.5,2.5,2.5,2.5,2.25,2.25,2.25,2,2,2,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75) Unemployment_Rate <- c(5.3,5.3,5.3,5.3,5.4,5.6,5.5,5.5,5.5,5.6,5.7,5.9,6,5.9,5.8,6.1,6.2,6.1,6.1,6.1,5.9,6.2,6.2,6.1) Stock_Index_Price <- c(1464,1394,1357,1293,1256,1254,1234,1195,1159,1167,1130,1075,1047,965,943,958,971,949,884,866,876,822,704,719) plot(x=Interest_Rate, y=Stock_Index_Price)

You’ll notice that indeed a linear relationship exists between the Stock_Index_Price and the Interest_Rate. Specifically, when interest rates go up, the stock index price also goes up:

And for the second case, you can use the code below in order to plot the relationship between the Stock_Index_Price and the Unemployment_Rate:

Year <- c(2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016) Month <- c(12, 11,10,9,8,7,6,5,4,3,2,1,12,11,10,9,8,7,6,5,4,3,2,1) Interest_Rate <- c(2.75,2.5,2.5,2.5,2.5,2.5,2.5,2.25,2.25,2.25,2,2,2,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75) Unemployment_Rate <- c(5.3,5.3,5.3,5.3,5.4,5.6,5.5,5.5,5.5,5.6,5.7,5.9,6,5.9,5.8,6.1,6.2,6.1,6.1,6.1,5.9,6.2,6.2,6.1) Stock_Index_Price <- c(1464,1394,1357,1293,1256,1254,1234,1195,1159,1167,1130,1075,1047,965,943,958,971,949,884,866,876,822,704,719) plot(x=Unemployment_Rate, y=Stock_Index_Price)

As you can see, a linear relationship also exists between the Stock_Index_Price and the Unemployment_Rate – when unemployment rates go up, the stock index price goes down (here we still have a linear relationship, but with a negative slope):

### Step 4: Apply the multiple linear regression in R

You may now use the following template to perform the multiple linear regression in R:

model <- lm(Dependent variable ~ First independent Variable + Second independent variable + ...) summary(model)

Using that template for our example:

Year <- c(2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016) Month <- c(12, 11,10,9,8,7,6,5,4,3,2,1,12,11,10,9,8,7,6,5,4,3,2,1) Interest_Rate <- c(2.75,2.5,2.5,2.5,2.5,2.5,2.5,2.25,2.25,2.25,2,2,2,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75) Unemployment_Rate <- c(5.3,5.3,5.3,5.3,5.4,5.6,5.5,5.5,5.5,5.6,5.7,5.9,6,5.9,5.8,6.1,6.2,6.1,6.1,6.1,5.9,6.2,6.2,6.1) Stock_Index_Price <- c(1464,1394,1357,1293,1256,1254,1234,1195,1159,1167,1130,1075,1047,965,943,958,971,949,884,866,876,822,704,719) model <- lm(Stock_Index_Price ~ Interest_Rate + Unemployment_Rate) summary(model)

Once you run the code in R, you’ll get the following summary:

You can use the coefficients in the summary in order to build the multiple linear regression equation as follows:

Stock_Index_Price = (Intercept) + (Interest_Rate coef)*X_{1} (Unemployment_Rate coef)*X_{2}

And once you plug the numbers from the summary:

Stock_Index_Price = (1798.4) + (345.5)*X_{1} + (-250.1)*X_{2 }

In the next section, we’ll see how to use this equation to make predictions.

### Step 5: Make a prediction

Now let’s make a prediction based on the equation above.

For example, imagine that you want to predict the stock index price after you collected the following data:

- Interest Rate = 1.5 (i.e., X
_{1}= 1.5) - Unemployment Rate = 5.8 (i.e., X
_{2}= 5.8)

And if you plug that data into the regression equation you’ll get:

Stock_Index_Price = (1798.4) + (345.5)*(1.5) + (-250.1)*(5.8) = 866.07

The predicted value for the Stock_Index_Price is therefore 866.07.

## Method to import data for the Multiple Linear Regression

Practically speaking, you may collect a large amount of data for you model. In those cases, it would be more efficient to import that data, as opposed to type it within the code.

For example, you can copy the same data-set that we saw at the beginning of the tutorial (under step 1), and then paste it within a CSV file.

You can then use the code below to perform the multiple linear regression in R. But before you apply this code, you’ll need to modify the path name to the location where you stored the CSV file on your computer.

mydata <- read.csv('C:\\Users\\doron\\Desktop\\Economy.csv', header = TRUE) model <- lm(Stock_Index_Price ~ Interest_Rate + Unemployment_Rate, data = mydata) summary(model)

If you run the code, you would get the same summary that we saw earlier:

Some additional stats to consider in the summary:

**Adjusted R-squared**reflects the fit of the model, where a higher value generally indicates a better fit**Intercept coefficient**is the Y-intercept**Interest_Rate coefficient**is the change in Y due to a change of one unit in the interest rate (everything else held constant)**Unemployment_Rate coefficient**is the change in Y due to a change of one unit in the unemployment rate (everything else held constant)**Std. Error**reflects the level of accuracy of the coefficients**Pr(>|t|)**is the*p-value*. A p-value of less than 0.05 is considered to be statistically significant