How to Join Pandas DataFrames using Merge

Need to join Pandas DataFrames?

If so, I’ll show you how to join Pandas DataFrames using Merge.

In particular, I’ll review the steps to create the following joins:

  • Inner Join
  • Left Join
  • Right Join
  • Outer Join

But before we dive into few examples, here is a template that you may refer to when joining DataFrames:

Join = pd.merge(df1, df2, how='type of join', on=['df1 key', 'df2 key'])

Steps to Join Pandas DataFrames using Merge

Step 1: Create the DataFrames to be joined

Let’s say that you have two datasets that you’d like to join:

(1) The clients dataset:

Client_IDClient_Name
111Jon Snow
222Maria Green
333Bill Jones
444Rick Lee
555Pamela Lopez

(2) The countries dataset:

Client_IDClient_Name
111UK
222Canada
333Spain
444China
777Brazil

The goal is to join the above two datasets using the common Client_ID key.

To start, you may create two DataFrames, where:

  • df1 will capture the first dataset of the clients data
  • df2 will capture the second dataset of the countries data

Here is the code that you can use to create the DataFrames:

import pandas as pd

clients = {'Client_ID': [111,222,333,444,555],
           'Client_Name': ['Jon Snow','Maria Green', 'Bill Jones','Rick Lee','Pamela Lopez']
           }
df1 = pd.DataFrame(clients,columns= ['Client_ID','Client_Name'])
print(df1)

countries = {'Client_ID': [111,222,333,444,777],
             'Client_Country': ['UK','Canada','Spain','China','Brazil']
             }
df2 = pd.DataFrame(countries,columns= ['Client_ID', 'Client_Country'])
print(df2)

Run the code in Python, and you’ll get the following two DataFrames:

Two datasets in Python

Step 2: Merge the pandas DataFrames using an inner join

You may add the following syntax in order to merge the two DataFrames using an inner join:

Inner_Join = pd.merge(df1, df2, how='inner', on=['Client_ID', 'Client_ID'])

You may notice that the how is equal to ‘inner’ to represent an inner join.

Here is the complete code that you may apply in Python:

import pandas as pd

clients = {'Client_ID': [111,222,333,444,555],
           'Client_Name': ['Jon Snow','Maria Green', 'Bill Jones','Rick Lee','Pamela Lopez']
           }
df1 = pd.DataFrame(clients,columns= ['Client_ID','Client_Name'])


countries = {'Client_ID': [111,222,333,444,777],
             'Client_Country': ['UK','Canada','Spain','China','Brazil']
             }
df2 = pd.DataFrame(countries,columns= ['Client_ID', 'Client_Country'])


Inner_Join = pd.merge(df1, df2, how='inner', on=['Client_ID', 'Client_ID'])
print(Inner_Join)

When performing an inner join, only the records that have the same key in both of the datasets will be captured. In our example, there are only 4 records in each of the datasets that contain the same Client_ID key. Only those 4 records will be displayed:

Inner join pandas dataframes

  • The record with the ‘555’ Client_ID, which is present in the clients dataset, but not in the countries dataset, will not be displayed, when using the inner join
  • Similarly, the record with the ‘777’ Client_ID, which is present in the countries dataset, but not in the clients dataset, will not be displayed

Applying a Left Join

Now let’s see what will happen when applying a left join. Simply add the following syntax into the code:

Left_Join = pd.merge(df1, df2, how='left', on=['Client_ID', 'Client_ID'])

You’ll notice that the ‘how’ is now set to ‘left’ to represent the left join.

The complete Python code is therefore:

import pandas as pd

clients = {'Client_ID': [111,222,333,444,555],
           'Client_Name': ['Jon Snow','Maria Green', 'Bill Jones','Rick Lee','Pamela Lopez']
           }
df1 = pd.DataFrame(clients,columns= ['Client_ID','Client_Name'])


countries = {'Client_ID': [111,222,333,444,777],
             'Client_Country': ['UK','Canada','Spain','China','Brazil']
             }
df2 = pd.DataFrame(countries,columns= ['Client_ID', 'Client_Country'])


Left_Join = pd.merge(df1, df2, how='left', on=['Client_ID', 'Client_ID'])
print(Left_Join)

With a left join, all the records from the first dataset will be displayed, irrespective of whether the keys in the first dataset can be found in the second dataset.

However, for the second dataset, only the records with the keys in the second dataset that can be found in the first dataset will be displayed.

How to Join Pandas DataFrames using Merge

  • The record with the ‘555’ Client_ID, which is present in the clients dataset, but not in the countries dataset, will be displayed, when using a left join
  • However, the record with the ‘777’ Client_ID, which is present in the countries dataset, but not in the clients dataset, will not be displayed

Also note that since there is no 555 key in the second dataset, there will be a NaN value under the Client_Country column from the second dataset.

Applying a Right Join

You may now apply this syntax to get the right join (where the ‘how’ is now set to ‘right’):

Right_Join = pd.merge(df1, df2, how='right', on=['Client_ID', 'Client_ID'])

So the complete Python code is:

import pandas as pd

clients = {'Client_ID': [111,222,333,444,555],
           'Client_Name': ['Jon Snow','Maria Green', 'Bill Jones','Rick Lee','Pamela Lopez']
           }
df1 = pd.DataFrame(clients,columns= ['Client_ID','Client_Name'])


countries = {'Client_ID': [111,222,333,444,777],
             'Client_Country': ['UK','Canada','Spain','China','Brazil']
             }
df2 = pd.DataFrame(countries,columns= ['Client_ID', 'Client_Country'])


Right_Join = pd.merge(df1, df2, how='right', on=['Client_ID', 'Client_ID'])
print(Right_Join)

For a right join, all the records from the second dataset will be displayed. However, only the records with the keys in the first dataset that can be found in the second dataset will be displayed.

Join Pandas DataFrames using Merge

  • The record with the ‘555’ Client_ID from the first dataset will not be displayed when applying a right join
  • While, the record with the ‘777’ Client_ID from the second dataset will be displayed

Applying an Outer Join

For the final case, you can apply an outer join by setting the ‘how’ to ‘outer’:

Outer_Join = pd.merge(df1, df2, how='outer', on=['Client_ID', 'Client_ID'])

Here is the Python code:

import pandas as pd

clients = {'Client_ID': [111,222,333,444,555],
           'Client_Name': ['Jon Snow','Maria Green', 'Bill Jones','Rick Lee','Pamela Lopez']
           }
df1 = pd.DataFrame(clients,columns= ['Client_ID','Client_Name'])


countries = {'Client_ID': [111,222,333,444,777],
             'Client_Country': ['UK','Canada','Spain','China','Brazil']
             }
df2 = pd.DataFrame(countries,columns= ['Client_ID', 'Client_Country'])


Outer_Join = pd.merge(df1, df2, how='outer', on=['Client_ID', 'Client_ID'])
print(Outer_Join)

With an outer join, all the records from both of the datasets will be displayed, irrespective of whether a key is missing in one of the datasets:

Merge pandas dataframes

As you can observer both the ‘555’ Client_ID from the first dataset, as well as the ‘777’ Client_ID from the second dataset, will be displayed.

A great source to learn more about joining pandas DataFrames is the Pandas Documentation.