In this short guide, you’ll see how to select rows that contain a specific substring in Pandas DataFrame.
In particular, you’ll observe 5 cases to get all rows that:
- Contain a specific substring
- Contain one substring OR another substring
- Do NOT contain given substrings
- Contain a specific substring in the middle of a string
- Contain a specific numeric value
The Example
To start, create a DataFrame in Python with the following data:
import pandas as pd
data = {
"month": ["January", "February", "March", "April", "May", "June", "July", "August", "September", "October",
"November", "December"],
"days_in_month": [31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31],
}
df = pd.DataFrame(data)
print(df)
Run the code, and you’ll get this DataFrame:
month days_in_month
0 January 31
1 February 28
2 March 31
3 April 30
4 May 31
5 June 30
6 July 31
7 August 31
8 September 30
9 October 31
10 November 30
11 December 31
5 Cases
Case 1: Get all rows that contain a specific substring
To get all the months that contain the substring of “Ju” (for the months of “June” and “July”):
import pandas as pd
data = {
"month": ["January", "February", "March", "April", "May", "June", "July", "August", "September", "October",
"November", "December"],
"days_in_month": [31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31],
}
df = pd.DataFrame(data)
contain_values = df[df["month"].str.contains("Ju")]
print(contain_values)
As you can see, the only two months that contain the substring of “Ju” are June and July:
month days_in_month
5 June 30
6 July 31
Note that str.contains() is case sensitive. Meaning that if you specified “ju” (all in lowercase), then you won’t get any selection:
import pandas as pd
data = {
"month": ["January", "February", "March", "April", "May", "June", "July", "August", "September", "October",
"November", "December"],
"days_in_month": [31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31],
}
df = pd.DataFrame(data)
contain_values = df[df["month"].str.contains("ju")]
print(contain_values)
In that case, you’ll get an empty DataFrame:
Empty DataFrame
Columns: [month, days_in_month]
Index: []
Case 2: Get all rows that contain one substring OR another substring
To get all the months that contain EITHER “Ju” OR “Ma” using the pipe symbol (“|”):
import pandas as pd
data = {
"month": ["January", "February", "March", "April", "May", "June", "July", "August", "September", "October",
"November", "December"],
"days_in_month": [31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31],
}
df = pd.DataFrame(data)
contain_values = df[df["month"].str.contains("Ju|Ma")]
print(contain_values)
Here are the months that you’ll get:
month days_in_month
2 March 31
4 May 31
5 June 30
6 July 31
Case 3: Get all rows that do NOT contain given substrings
Now select all the months that neither contain “Ju” nor “Ma” by applying the “~” symbol before the df[“month”]:
import pandas as pd
data = {
"month": ["January", "February", "March", "April", "May", "June", "July", "August", "September", "October",
"November", "December"],
"days_in_month": [31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31],
}
df = pd.DataFrame(data)
contain_values = df[~df["month"].str.contains("Ju|Ma")]
print(contain_values)
Run the code, and you’ll get the following months:
month days_in_month
0 January 31
1 February 28
3 April 30
7 August 31
8 September 30
9 October 31
10 November 30
11 December 31
Case 4: Get all rows that contain a specific substring in the middle of a string
To get all the months that contain “uar” (for January and February):
import pandas as pd
data = {
"month": ["January", "February", "March", "April", "May", "June", "July", "August", "September", "October",
"November", "December"],
"days_in_month": [31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31],
}
df = pd.DataFrame(data)
contain_values = df[df["month"].str.contains("uar")]
print(contain_values)
You’ll now see the months of January and February:
month days_in_month
0 January 31
1 February 28
Case 5: Get all rows that contain a specific numeric value
In order to select all the rows which contain the numeric value of 0 under the “days_in_month” column, you’ll first need to convert that column from integers to strings:
import pandas as pd
data = {
"month": ["January", "February", "March", "April", "May", "June", "July", "August", "September", "October",
"November", "December"],
"days_in_month": [31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31],
}
df = pd.DataFrame(data)
contain_values = df[df["days_in_month"].astype(str).str.contains("0")]
print(contain_values)
As you can see, only the months that contain the numeric value of 0 were selected:
month days_in_month
3 April 30
5 June 30
8 September 30
10 November 30
You can read more about str.contains by visiting the Pandas Documentation.