Logistic Regression in Python
In this tutorial, you will learn how to run a logistic regression in Python.
TLDR solution
import pandas as pd
import statsmodels.api as sm
df = pd.DataFrame(data)
X_train = df[['x1', 'x2']]
y_train = df['y']
logit = sm.Logit(y_train, X_train).fit()
yhat = logit.predict(X_test)
prediction = list(map(round, yhat))
Step-by-Step Example
Step 1: Install pandas and statsmodels
If you don't have pandas and statsmodels already installed, execute the following command in your terminal:
pip install pandas statsmodels
Step 2: Prepare Your Data
For demonstration purposes, let's work with fish market data which you can download by clicking here. Import it and have a first look at the raw data:
import pandas as pd
df = pd.read_csv("fishmarket.csv")
print(df.shape)
print(df.head())
The output should look like this:
(159, 7)
Species Weight Length1 Length2 Length3 Height Width
0 Bream 242.0 23.2 25.4 30.0 11.5200 4.0200
1 Bream 290.0 24.0 26.3 31.2 12.4800 4.3056
2 Bream 340.0 23.9 26.5 31.1 12.3778 4.6961
3 Bream 363.0 26.3 29.0 33.5 12.7300 4.4555
4 Bream 430.0 26.5 29.0 34.0 12.4440 5.1340
The dataset has 159 entries recording the fish species, its weight, three lengths dimensions (vertical, diagonal, cross), height and width.
Suppose, you want to predict whether a fish weighs at least 1 pound (453.6 grams) using its length, heigth and width.
For simplicity, let's drop all rows where the Species column value is equal to Pike since they are known to be heavy for their size.
df = df.loc[df['Species'] != 'Pike'].reset_index(drop=True)
Next, you can define the logistic function as follows:
where is Length1, the Height and the Width.
Create the binary dependent variable and split the data into a train and a test set accordingly:
df['ge1lb'] = df['Weight'].apply(lambda x:1 if x >= 454 else 0)
df = df[['ge1lb', 'Length1', 'Height', 'Width']]
df_train = df.sample(frac=0.8, random_state=42)
df_test = df.iloc[~df_train.index]
Step 3: Run a Logistic Regression
Next, fit the logit model using statsmodels' Logit method:
import statsmodels.api as sm
X_train = df_train[['Length1', 'Height', 'Width']]
y_train = df_train[['ge1lb']]
X_test = df_test[['Length1', 'Height', 'Width']]
y_test = df_test[['ge1lb']]
logit = sm.Logit(y_train, X_train).fit()
Step 4: Predict
Finally, let's predict using the test set:
yhat = logit.predict(X_test)
# use round to apply the .5 decision threshold
prediction = yhat.apply(round)
Step 5: Evaluate
Let's calculate the accuracy of the predictions:
evals = pd.concat([y_test, pd.DataFrame(prediction, columns=['prediction'])], axis=1)
evals.loc[evals['ge1lb'] == evals['prediction'], 'correct'] = 1
evals['correct'] = evals['correct'].fillna(0).astype('int')
accuracy = evals['correct'].sum()/len(evals['correct'])
print(accuracy)
0.7807017543859649
Not too bad! To further improve it, have look at the data (scatter plot) and come up with features such as a dummy variable for each fish species.
That's it! You just learned how to run a binary logistic regression using Python and statsmodels.