To join Pandas DataFrames using merge:
pd.merge(df1, df2, how='type of join', on=['df1 key', 'df2 key'])
In this short guide, you’ll see the steps create the following joins:
- Inner
- Left
- Right
- Outer
Steps to Join Pandas DataFrames using Merge
Step 1: Create the DataFrames to be joined
Let’s say that you have two datasets that you’d like to join:
(1) The products dataset:
product_id | product |
111 | computer |
222 | tablet |
333 | printer |
444 | keyboard |
555 | monitor |
(2) The brands dataset:
product_id | brand |
111 | A |
222 | B |
333 | C |
444 | D |
777 | E |
The goal is to join the above two datasets using the common product_id key.
To start, you may create the two DataFrames, where:
- df1 will capture the products data
- df2 will capture the brands data
Here is the code to create the DataFrames:
import pandas as pd
data1 = {'product_id': [111, 222, 333, 444, 555],
'product': ['computer', 'tablet', 'printer', 'keyboard', 'monitor']
}
df1 = pd.DataFrame(data1)
print(df1)
data2 = {'product_id': [111, 222, 333, 444, 777],
'brand': ['A', 'B', 'C', 'D', 'E']
}
df2 = pd.DataFrame(data2)
print(df2)
Run the code in Python, and you’ll get the following two DataFrames:
product_id product
0 111 computer
1 222 tablet
2 333 printer
3 444 keyboard
4 555 monitor
product_id brand
0 111 A
1 222 B
2 333 C
3 444 D
4 777 E
Step 2: Merge the DataFrames using an Inner Join
You may add this syntax in order to merge the two DataFrames using an inner join:
inner_join = pd.merge(df1, df2, how='inner', on=['product_id', 'product_id'])
Notice that the how is equal to ‘inner’ to represent an inner join.
Therefore, the complete Python code is:
import pandas as pd
data1 = {'product_id': [111, 222, 333, 444, 555],
'product': ['computer', 'tablet', 'printer', 'keyboard', 'monitor']
}
df1 = pd.DataFrame(data1)
data2 = {'product_id': [111, 222, 333, 444, 777],
'brand': ['A', 'B', 'C', 'D', 'E']
}
df2 = pd.DataFrame(data2)
inner_join = pd.merge(df1, df2, how='inner', on=['product_id', 'product_id'])
print(inner_join)
When performing an inner join, only the records that have the same key in both of the datasets would be captured. In our example, there are only 4 records in each of the datasets that contain the same ‘product_id’ key:
product_id product brand
0 111 computer A
1 222 tablet B
2 333 printer C
3 444 keyboard D
- The record with the ‘555’ product_id, which is present in the products dataset, but not in the brands dataset, would not be displayed, when applying the inner join
- Similarly, the record with the ‘777’ product_id, which is present in the brands dataset, but not in the products dataset, would not be displayed
Applying a Left Join
Now let’s see what would happen when you apply a left join. Simply use the following syntax:
left_join = pd.merge(df1, df2, how='left', on=['product_id', 'product_id'])
Notice that the ‘how’ is now set to ‘left’ to represent a left join.
The complete Python code is therefore:
import pandas as pd
data1 = {'product_id': [111, 222, 333, 444, 555],
'product': ['computer', 'tablet', 'printer', 'keyboard', 'monitor']
}
df1 = pd.DataFrame(data1)
data2 = {'product_id': [111, 222, 333, 444, 777],
'brand': ['A', 'B', 'C', 'D', 'E']
}
df2 = pd.DataFrame(data2)
left_join = pd.merge(df1, df2, how='left', on=['product_id', 'product_id'])
print(left_join)
With a left join, all the records from the first dataset would be displayed. However, only the records with the keys in the second dataset, which can be found in the first dataset, would be displayed:
product_id product brand
0 111 computer A
1 222 tablet B
2 333 printer C
3 444 keyboard D
4 555 monitor NaN
- The record with the ‘555’ product_id, which is present in the products dataset, but not in the brands dataset, would be displayed, when using a left join
- However, the record with the ‘777’ product_id, which is present in the brands dataset, but not in the products dataset, would not be displayed
Also note that since there is no 555 key in the second dataset, there will be a NaN value under the ‘brand’ column.
Applying a Right Join
You may now apply this syntax to get a right join (where the ‘how’ is now set to ‘right’):
right_join = pd.merge(df1, df2, how='right', on=['product_id', 'product_id'])
So the complete Python code is:
import pandas as pd
data1 = {'product_id': [111, 222, 333, 444, 555],
'product': ['computer', 'tablet', 'printer', 'keyboard', 'monitor']
}
df1 = pd.DataFrame(data1)
data2 = {'product_id': [111, 222, 333, 444, 777],
'brand': ['A', 'B', 'C', 'D', 'E']
}
df2 = pd.DataFrame(data2)
right_join = pd.merge(df1, df2, how='right', on=['product_id', 'product_id'])
print(right_join)
For a right join, all the records from the second dataset would be displayed. However, only the records with the keys in the first dataset, which can be found in the second dataset, would be displayed:
product_id product brand
0 111 computer A
1 222 tablet B
2 333 printer C
3 444 keyboard D
4 777 NaN E
- The record with the ‘555’ product_id from the first dataset would not be displayed when applying a right join
- While, the record with the ‘777’ product_id from the second dataset would be displayed
Applying an Outer Join
For the final case, you can apply an outer join by setting the ‘how’ to ‘outer’:
outer_join = pd.merge(df1, df2, how='outer', on=['product_id', 'product_id'])
Here is the Python code:
import pandas as pd
data1 = {'product_id': [111, 222, 333, 444, 555],
'product': ['computer', 'tablet', 'printer', 'keyboard', 'monitor']
}
df1 = pd.DataFrame(data1)
data2 = {'product_id': [111, 222, 333, 444, 777],
'brand': ['A', 'B', 'C', 'D', 'E']
}
df2 = pd.DataFrame(data2)
outer_join = pd.merge(df1, df2, how='outer', on=['product_id', 'product_id'])
print(outer_join)
With an outer join, all the records from both datasets would be displayed, irrespective of whether a key is missing in one of the datasets:
product_id product brand
0 111 computer A
1 222 tablet B
2 333 printer C
3 444 keyboard D
4 555 monitor NaN
5 777 NaN E
As you may observe, both the ‘555’ product_id from the first dataset, as well as the ‘777’ product_id from the second dataset, would be displayed.
You can learn more about joining Pandas DataFrames by visiting the Pandas Documentation.