Join Pandas DataFrames using Merge

To join Pandas DataFrames using merge:

pd.merge(df1, df2, how='type of join', on=['df1 key', 'df2 key'])

In this short guide, you’ll see the steps create the following joins:

  • Inner
  • Left
  • Right
  • Outer

Steps to Join Pandas DataFrames using Merge

Step 1: Create the DataFrames to be joined

Let’s say that you have two datasets that you’d like to join:

(1) The products dataset:

product_idproduct
111computer
222tablet
333printer
444keyboard
555monitor

(2) The brands dataset:

product_idbrand
111A
222B
333C
444D
777E

The goal is to join the above two datasets using the common product_id key.

To start, you may create the two DataFrames, where:

  • df1 will capture the products data
  • df2 will capture the brands data

Here is the code to create the DataFrames:

import pandas as pd

data1 = {'product_id': [111, 222, 333, 444, 555],
'product': ['computer', 'tablet', 'printer', 'keyboard', 'monitor']
}

df1 = pd.DataFrame(data1)
print(df1)

data2 = {'product_id': [111, 222, 333, 444, 777],
'brand': ['A', 'B', 'C', 'D', 'E']
}

df2 = pd.DataFrame(data2)
print(df2)

Run the code in Python, and you’ll get the following two DataFrames:

   product_id   product
0         111  computer
1         222    tablet
2         333   printer
3         444  keyboard
4         555   monitor
   product_id brand
0         111     A
1         222     B
2         333     C
3         444     D
4         777     E

Step 2: Merge the DataFrames using an Inner Join

You may add this syntax in order to merge the two DataFrames using an inner join:

inner_join = pd.merge(df1, df2, how='inner', on=['product_id', 'product_id'])

Notice that the how is equal to ‘inner’ to represent an inner join.

Therefore, the complete Python code is:

import pandas as pd

data1 = {'product_id': [111, 222, 333, 444, 555],
'product': ['computer', 'tablet', 'printer', 'keyboard', 'monitor']
}

df1 = pd.DataFrame(data1)

data2 = {'product_id': [111, 222, 333, 444, 777],
'brand': ['A', 'B', 'C', 'D', 'E']
}

df2 = pd.DataFrame(data2)

inner_join = pd.merge(df1, df2, how='inner', on=['product_id', 'product_id'])
print(inner_join)

When performing an inner join, only the records that have the same key in both of the datasets would be captured. In our example, there are only 4 records in each of the datasets that contain the same ‘product_id’ key:

   product_id   product brand
0         111  computer     A
1         222    tablet     B
2         333   printer     C
3         444  keyboard     D
  • The record with the ‘555’ product_id, which is present in the products dataset, but not in the brands dataset, would not be displayed, when applying the inner join
  • Similarly, the record with the ‘777’ product_id, which is present in the brands dataset, but not in the products dataset, would not be displayed

Applying a Left Join

Now let’s see what would happen when you apply a left join. Simply use the following syntax:

left_join = pd.merge(df1, df2, how='left', on=['product_id', 'product_id'])

Notice that the ‘how’ is now set to ‘left’ to represent a left join.

The complete Python code is therefore:

import pandas as pd

data1 = {'product_id': [111, 222, 333, 444, 555],
'product': ['computer', 'tablet', 'printer', 'keyboard', 'monitor']
}

df1 = pd.DataFrame(data1)

data2 = {'product_id': [111, 222, 333, 444, 777],
'brand': ['A', 'B', 'C', 'D', 'E']
}

df2 = pd.DataFrame(data2)

left_join = pd.merge(df1, df2, how='left', on=['product_id', 'product_id'])
print(left_join)

With a left join, all the records from the first dataset would be displayed. However, only the records with the keys in the second dataset, which can be found in the first dataset, would be displayed:

   product_id   product brand
0         111  computer     A
1         222    tablet     B
2         333   printer     C
3         444  keyboard     D
4         555   monitor   NaN
  • The record with the ‘555’ product_id, which is present in the products dataset, but not in the brands dataset, would be displayed, when using a left join
  • However, the record with the ‘777’ product_id, which is present in the brands dataset, but not in the products dataset, would not be displayed

Also note that since there is no 555 key in the second dataset, there will be a NaN value under the ‘brand’ column.

Applying a Right Join

You may now apply this syntax to get a right join (where the ‘how’ is now set to ‘right’):

right_join = pd.merge(df1, df2, how='right', on=['product_id', 'product_id'])

So the complete Python code is:

import pandas as pd

data1 = {'product_id': [111, 222, 333, 444, 555],
'product': ['computer', 'tablet', 'printer', 'keyboard', 'monitor']
}

df1 = pd.DataFrame(data1)

data2 = {'product_id': [111, 222, 333, 444, 777],
'brand': ['A', 'B', 'C', 'D', 'E']
}

df2 = pd.DataFrame(data2)

right_join = pd.merge(df1, df2, how='right', on=['product_id', 'product_id'])
print(right_join)

For a right join, all the records from the second dataset would be displayed. However, only the records with the keys in the first dataset, which can be found in the second dataset, would be displayed:

   product_id   product brand
0         111  computer     A
1         222    tablet     B
2         333   printer     C
3         444  keyboard     D
4         777       NaN     E
  • The record with the ‘555’ product_id from the first dataset would not be displayed when applying a right join
  • While, the record with the ‘777’ product_id from the second dataset would be displayed

Applying an Outer Join

For the final case, you can apply an outer join by setting the ‘how’ to ‘outer’:

outer_join = pd.merge(df1, df2, how='outer', on=['product_id', 'product_id'])

Here is the Python code:

import pandas as pd

data1 = {'product_id': [111, 222, 333, 444, 555],
'product': ['computer', 'tablet', 'printer', 'keyboard', 'monitor']
}

df1 = pd.DataFrame(data1)

data2 = {'product_id': [111, 222, 333, 444, 777],
'brand': ['A', 'B', 'C', 'D', 'E']
}

df2 = pd.DataFrame(data2)

outer_join = pd.merge(df1, df2, how='outer', on=['product_id', 'product_id'])
print(outer_join)

With an outer join, all the records from both datasets would be displayed, irrespective of whether a key is missing in one of the datasets:

   product_id   product brand
0         111  computer     A
1         222    tablet     B
2         333   printer     C
3         444  keyboard     D
4         555   monitor   NaN
5         777       NaN     E

As you may observe, both the ‘555’ product_id from the first dataset, as well as the ‘777’ product_id from the second dataset, would be displayed.

You can learn more about joining Pandas DataFrames by visiting the Pandas Documentation.

Leave a Comment