Logistic Regression in Python

In this tutorial, you will learn how to run a logistic regression in Python.

TLDR solution

logistic_regression.py
import pandas as pd
import statsmodels.api as sm

df = pd.DataFrame(data)

X_train = df[['x1', 'x2']]
y_train = df['y']

logit = sm.Logit(y_train, X_train).fit()

yhat = logit.predict(X_test)
prediction = list(map(round, yhat))

Step-by-Step Example

Step 1: Install pandas and statsmodels

If you don't have pandas and statsmodels already installed, execute the following command in your terminal:

pip install pandas statsmodels

Step 2: Prepare Your Data

For demonstration purposes, let's work with fish market data which you can download by clicking here. Import it and have a first look at the raw data:

import pandas as pd

df = pd.read_csv("fishmarket.csv")

print(df.shape)
print(df.head())

The output should look like this:

(159, 7)
  Species  Weight  Length1  Length2  Length3   Height   Width
0   Bream   242.0     23.2     25.4     30.0  11.5200  4.0200
1   Bream   290.0     24.0     26.3     31.2  12.4800  4.3056
2   Bream   340.0     23.9     26.5     31.1  12.3778  4.6961
3   Bream   363.0     26.3     29.0     33.5  12.7300  4.4555
4   Bream   430.0     26.5     29.0     34.0  12.4440  5.1340

The dataset has 159 entries recording the fish species, its weight, three lengths dimensions (vertical, diagonal, cross), height and width.

Suppose, you want to predict whether a fish weighs at least 1 pound (453.6 grams) using its length, heigth and width. For simplicity, let's drop all rows where the Species column value is equal to Pike since they are known to be heavy for their size.

df = df.loc[df['Species'] != 'Pike'].reset_index(drop=True)

Next, you can define the logistic function as follows:

p(x)=11e(β0+β1x1+β2x2+β3x3)p(x)=\frac{1}{1-e^{-(\beta_0+\beta_1x_1+\beta_2x_2+\beta_3x_3)}}

where x1x_1 is Length1, x2x_2 the Height and x3x_3 the Width.

Create the binary dependent variable and split the data into a train and a test set accordingly:

df['ge1lb'] = df['Weight'].apply(lambda x:1 if x >= 454 else 0)
df = df[['ge1lb', 'Length1', 'Height', 'Width']]

df_train = df.sample(frac=0.8, random_state=42)
df_test = df.iloc[~df_train.index]

Step 3: Run a Logistic Regression

Next, fit the logit model using statsmodels' Logit method:

import statsmodels.api as sm

X_train = df_train[['Length1', 'Height', 'Width']]
y_train = df_train[['ge1lb']]

X_test = df_test[['Length1', 'Height', 'Width']]
y_test = df_test[['ge1lb']]

logit = sm.Logit(y_train, X_train).fit()

Step 4: Predict

Finally, let's predict using the test set:

yhat = logit.predict(X_test)
# use round to apply the .5 decision threshold
prediction = yhat.apply(round)

Step 5: Evaluate

Let's calculate the accuracy of the predictions:

evals = pd.concat([y_test, pd.DataFrame(prediction, columns=['prediction'])], axis=1)

evals.loc[evals['ge1lb'] == evals['prediction'], 'correct'] = 1
evals['correct'] = evals['correct'].fillna(0).astype('int')

accuracy = evals['correct'].sum()/len(evals['correct'])
print(accuracy)
0.7807017543859649

Not too bad! To further improve it, have look at the data (scatter plot) and come up with features such as a dummy variable for each fish species.

That's it! You just learned how to run a binary logistic regression using Python and statsmodels.