In this guide, you’ll see how to select rows that contain a specific substring in Pandas DataFrame.
In particular, you’ll observe 5 scenarios to get all rows that:
- Contain a specific substring
- Contain one substring OR another substring
- Do NOT contain given substrings
- Contain specific substring in the middle of a string
- Contain a specific numeric value
Example of DataFrame
To start with a simple example, let’s create a DataFrame for the following data:
month | days_in_month |
January | 31 |
February | 28 |
March | 31 |
April | 30 |
May | 31 |
June | 30 |
July | 31 |
August | 31 |
September | 30 |
October | 31 |
November | 30 |
December | 31 |
Here is the code to create the DataFrame in Python:
import pandas as pd data = {'month': ['January','February','March','April','May','June','July','August','September','October','November','December'], 'days_in_month': [31,28,31,30,31,30,31,31,30,31,30,31] } df = pd.DataFrame(data, columns = ['month', 'days_in_month']) print (df)
Once your run the code, you’ll get this DataFrame:
month days_in_month
0 January 31
1 February 28
2 March 31
3 April 30
4 May 31
5 June 30
6 July 31
7 August 31
8 September 30
9 October 31
10 November 30
11 December 31
The ultimate goal is to select all the rows that contain specific substrings in the above Pandas DataFrame. Here are 5 scenarios:
5 Scenarios to Select Rows that Contain a Substring in Pandas DataFrame
(1) Get all rows that contain a specific substring
To begin, let’s get all the months that contain the substring of ‘Ju‘ (for the months of ‘June’ and ‘July’):
import pandas as pd data = {'month': ['January','February','March','April','May','June','July','August','September','October','November','December'], 'days_in_month': [31,28,31,30,31,30,31,31,30,31,30,31] } df = pd.DataFrame(data, columns = ['month', 'days_in_month']) contain_values = df[df['month'].str.contains('Ju')] print (contain_values)
As you can see, the only two months that contain the substring of ‘Ju’ are June and July:
month days_in_month
5 June 30
6 July 31
Note that str.contains() is case sensitive. Meaning that if you specified ‘ju‘ (all in lowercase), while the original values contained a ‘J‘ in uppercase, then you won’t get any selection:
import pandas as pd data = {'month': ['January','February','March','April','May','June','July','August','September','October','November','December'], 'days_in_month': [31,28,31,30,31,30,31,31,30,31,30,31] } df = pd.DataFrame(data, columns = ['month', 'days_in_month']) contain_values = df[df['month'].str.contains('ju')] print (contain_values)
In that case, you’ll get an empty DataFrame:
Empty DataFrame
Columns: [month, days_in_month]
Index: []
(2) Get all rows that contain one substring OR another substring
Now let’s get all the months that contain EITHER ‘Ju‘ OR ‘Ma‘
In that case, you’ll need to use the pipe symbol (‘|’) as follows:
import pandas as pd data = {'month': ['January','February','March','April','May','June','July','August','September','October','November','December'], 'days_in_month': [31,28,31,30,31,30,31,31,30,31,30,31] } df = pd.DataFrame(data, columns = ['month', 'days_in_month']) contain_values = df[df['month'].str.contains('Ju|Ma')] print (contain_values)
Here are the months that you’ll get:
month days_in_month
2 March 31
4 May 31
5 June 30
6 July 31
(3) Get all rows that do NOT contain given substrings
Now let’s select all the months that neither contain ‘Ju’ nor ‘Ma’
In that case, apply the ‘~’ symbol before the df[‘month’]:
import pandas as pd data = {'month': ['January','February','March','April','May','June','July','August','September','October','November','December'], 'days_in_month': [31,28,31,30,31,30,31,31,30,31,30,31] } df = pd.DataFrame(data, columns = ['month', 'days_in_month']) contain_values = df[~df['month'].str.contains('Ju|Ma')] print (contain_values)
Run the code, and you’ll get the following months:
month days_in_month
0 January 31
1 February 28
3 April 30
7 August 31
8 September 30
9 October 31
10 November 30
11 December 31
(4) Get all rows that contain specific substring in the middle of a string
Let’s get all the months that contain ‘uar‘ (for January and February):
import pandas as pd data = {'month': ['January','February','March','April','May','June','July','August','September','October','November','December'], 'days_in_month': [31,28,31,30,31,30,31,31,30,31,30,31] } df = pd.DataFrame(data, columns = ['month', 'days_in_month']) contain_values = df[df['month'].str.contains('uar')] print (contain_values)
You’ll now see the months of January and February:
month days_in_month
0 January 31
1 February 28
(5) Get all rows that contain a specific numeric value
What if you’d like to select all the rows that contain a specific numeric value?
For example, what if you want to select all the rows which contain the numeric value of ‘0‘ under the ‘days_in_month’ column?
In that case, you’ll need to convert the ‘days_in_month’ column from integers to strings before you can apply the str.contains():
import pandas as pd data = {'month': ['January','February','March','April','May','June','July','August','September','October','November','December'], 'days_in_month': [31,28,31,30,31,30,31,31,30,31,30,31] } df = pd.DataFrame(data, columns = ['month', 'days_in_month']) contain_values = df[df['days_in_month'].astype(str).str.contains('0')] print (contain_values)
As you can see, only the months that contain the numeric value of ‘0‘ were selected:
month days_in_month
3 April 30
5 June 30
8 September 30
10 November 30
You can read more about str.contains by visiting the Pandas Documentation.