# Example of Logistic Regression in Python

In this guide, I’ll show you an example of Logistic Regression in Python.

In general, a binary logistic regression describes the relationship between the dependent binary variable and one or more independent variable/s.

The binary dependent variable has two possible outcomes:

• ‘1’ for true/success; or
• ‘0’ for false/failure

Let’s now see how to apply logistic regression in Python using a practical example.

## Steps to Apply Logistic Regression in Python

### Step 1: Gather your data

To start with a simple example, let’s say that your goal is to build a logistic regression model in Python in order to determine whether candidates would get admitted to a prestigious university.

Here, there are two possible outcomes: Admitted (represented by the value of ‘1’) vs. Rejected (represented by the value of ‘0’).

You can then build a logistic regression in Python, where:

• The dependent variable represents whether a person gets admitted; and
• The 3 independent variables are the GMAT score, GPA and Years of work experience

This is how the dataset would look like:

Note that the above dataset contains 40 observations. In practice, you’ll need a larger sample size to get more accurate results.

### Step 2: Import the needed Python packages

Before you start, make sure that the following packages are installed in Python:

You’ll then need to import all the packages as follows:

```import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
import seaborn as sn
import matplotlib.pyplot as plt
```

### Step 3: Build a dataframe

For this step, you’ll need to capture the dataset (from step 1) in Python. You can accomplish this task using pandas Dataframe:

```import pandas as pd
candidates = {'gmat': [780,750,690,710,680,730,690,720,740,690,610,690,710,680,770,610,580,650,540,590,620,600,550,550,570,670,660,580,650,660,640,620,660,660,680,650,670,580,590,690],
'gpa': [4,3.9,3.3,3.7,3.9,3.7,2.3,3.3,3.3,1.7,2.7,3.7,3.7,3.3,3.3,3,2.7,3.7,2.7,2.3,3.3,2,2.3,2.7,3,3.3,3.7,2.3,3.7,3.3,3,2.7,4,3.3,3.3,2.3,2.7,3.3,1.7,3.7],
'work_experience': [3,4,3,5,4,6,1,4,5,1,3,5,6,4,3,1,4,6,2,3,2,1,4,1,2,6,4,2,6,5,1,2,4,6,5,1,2,1,4,5],
}

print (df)
```

Alternatively, you could import the data into Python from an external file.

### Step 4: Create the logistic regression in Python

Now, set the independent variables (represented as X) and the dependent variable (represented as y):

```X = df[['gmat', 'gpa','work_experience']]
```

Then, apply train_test_split. For example, you can set the test size to 0.25, and therefore the model testing will be based on 25% of the dataset, while the model training will be based on 75% of the dataset:

```X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=0)
```

Apply the logistic regression as follows:

```logistic_regression= LogisticRegression()
logistic_regression.fit(X_train,y_train)
y_pred=logistic_regression.predict(X_test)
```

Then, use the code below to get the Confusion Matrix:

```confusion_matrix = pd.crosstab(y_test, y_pred, rownames=['Actual'], colnames=['Predicted'])
sn.heatmap(confusion_matrix, annot=True)
```

For the final part, print the Accuracy and plot the Confusion Matrix:

```print('Accuracy: ',metrics.accuracy_score(y_test, y_pred))
plt.show()
```

Putting all the code components together:

```import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
import seaborn as sn
import matplotlib.pyplot as plt

candidates = {'gmat': [780,750,690,710,680,730,690,720,740,690,610,690,710,680,770,610,580,650,540,590,620,600,550,550,570,670,660,580,650,660,640,620,660,660,680,650,670,580,590,690],
'gpa': [4,3.9,3.3,3.7,3.9,3.7,2.3,3.3,3.3,1.7,2.7,3.7,3.7,3.3,3.3,3,2.7,3.7,2.7,2.3,3.3,2,2.3,2.7,3,3.3,3.7,2.3,3.7,3.3,3,2.7,4,3.3,3.3,2.3,2.7,3.3,1.7,3.7],
'work_experience': [3,4,3,5,4,6,1,4,5,1,3,5,6,4,3,1,4,6,2,3,2,1,4,1,2,6,4,2,6,5,1,2,4,6,5,1,2,1,4,5],
}

#print (df)

X = df[['gmat', 'gpa','work_experience']]

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=0)

logistic_regression= LogisticRegression()
logistic_regression.fit(X_train,y_train)
y_pred=logistic_regression.predict(X_test)

confusion_matrix = pd.crosstab(y_test, y_pred, rownames=['Actual'], colnames=['Predicted'])
sn.heatmap(confusion_matrix, annot=True)

print('Accuracy: ',metrics.accuracy_score(y_test, y_pred))
plt.show()
```

Run the code in Python, and you’ll get the following Confusion Matrix with an Accuracy of 0.8 (note that depending on your sklearn version, you may get a different accuracy results. In my case, the sklearn version is 0.22.2):

As can be observed from the matrix:

• TP = True Positives = 4
• TN = True Negatives = 4
• FP = False Positives = 1
• FN = False Negatives = 1

You can then also get the Accuracy using:

Accuracy = (TP+TN)/Total = (4+4)/10 = 0.8

The accuracy is therefore 80% for the test set.

## Diving Deeper into the Results

Let’s now print two components in the python code:

• print (X_test)
• print (y_pred)

Here is the code used:

```import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

candidates = {'gmat': [780,750,690,710,680,730,690,720,740,690,610,690,710,680,770,610,580,650,540,590,620,600,550,550,570,670,660,580,650,660,640,620,660,660,680,650,670,580,590,690],
'gpa': [4,3.9,3.3,3.7,3.9,3.7,2.3,3.3,3.3,1.7,2.7,3.7,3.7,3.3,3.3,3,2.7,3.7,2.7,2.3,3.3,2,2.3,2.7,3,3.3,3.7,2.3,3.7,3.3,3,2.7,4,3.3,3.3,2.3,2.7,3.3,1.7,3.7],
'work_experience': [3,4,3,5,4,6,1,4,5,1,3,5,6,4,3,1,4,6,2,3,2,1,4,1,2,6,4,2,6,5,1,2,4,6,5,1,2,1,4,5],
}

X = df[['gmat', 'gpa','work_experience']]

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=0)  #train is based on 75% of the dataset, test is based on 25% of dataset

logistic_regression= LogisticRegression()
logistic_regression.fit(X_train,y_train)
y_pred=logistic_regression.predict(X_test)

print (X_test) #test dataset
print (y_pred) #predicted values
```

Recall that our original dataset (from step 1) had 40 observations. Since we set the test size to 0.25, then the confusion matrix displayed the results for 10 records (=40*0.25). These are the 10 test records:

The prediction was also made for those 10 records (where 1 = admitted, while 0 = rejected):

In the actual dataset (from step-1), you’ll see that for the test data, we got the correct results 8 out of 10 times:

This is matching with the accuracy level of 80%

## Checking the Prediction for a New Set of Data

Let’s say that you have a new set of data, with 5 new candidates:

 gmat gpa work_experience 590 2 3 740 3.7 4 680 3.3 6 610 2.3 1 710 3 5

Your goal is to use the existing logistic regression model to predict whether the new candidates will get admitted.

The new set of data can then be captured in a second DataFrame called df2:

```new_candidates = {'gmat': [590,740,680,610,710],
'gpa': [2,3.7,3.3,2.3,3],
'work_experience': [3,4,6,1,5]
}

df2 = pd.DataFrame(new_candidates,columns= ['gmat', 'gpa','work_experience'])
```

And here is the complete code to get the prediction for the 5 new candidates:

```import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

candidates = {'gmat': [780,750,690,710,680,730,690,720,740,690,610,690,710,680,770,610,580,650,540,590,620,600,550,550,570,670,660,580,650,660,640,620,660,660,680,650,670,580,590,690],
'gpa': [4,3.9,3.3,3.7,3.9,3.7,2.3,3.3,3.3,1.7,2.7,3.7,3.7,3.3,3.3,3,2.7,3.7,2.7,2.3,3.3,2,2.3,2.7,3,3.3,3.7,2.3,3.7,3.3,3,2.7,4,3.3,3.3,2.3,2.7,3.3,1.7,3.7],
'work_experience': [3,4,3,5,4,6,1,4,5,1,3,5,6,4,3,1,4,6,2,3,2,1,4,1,2,6,4,2,6,5,1,2,4,6,5,1,2,1,4,5],
}

X = df[['gmat', 'gpa','work_experience']]

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=0)  #in this case, you may choose to set the test_size=0. You should get the same prediction here

logistic_regression= LogisticRegression()
logistic_regression.fit(X_train,y_train)

new_candidates = {'gmat': [590,740,680,610,710],
'gpa': [2,3.7,3.3,2.3,3],
'work_experience': [3,4,6,1,5]
}

df2 = pd.DataFrame(new_candidates,columns= ['gmat', 'gpa','work_experience'])
y_pred=logistic_regression.predict(df2)

print (df2)
print (y_pred)
```

Run the code, and you’ll get the following prediction:

The first and fourth candidates are not expected to be admitted, while the other candidates are expected to be admitted.