Select Rows Containing a Substring in Pandas DataFrame

In this guide, you’ll see how to select rows that contain a specific substring in Pandas DataFrame.

In particular, you’ll observe 5 scenarios to get all rows that:

  1. Contain a specific substring
  2. Contain one substring OR another substring
  3. Do NOT contain given substrings
  4. Contain specific substring in the middle of a string
  5. Contain a specific numeric value

Example of DataFrame

To start with a simple example, let’s create a DataFrame for the following data:

Example of dataset

Here is the code to create the DataFrame in Python:

import pandas as pd

data = {'Month': ['January','February','March','April','May','June','July','August','September','October','November','December'],
        'Days in Month': [31,28,31,30,31,30,31,31,30,31,30,31]
        }

df = pd.DataFrame(data, columns = ['Month', 'Days in Month'])
print (df)

Once your run the code, you’ll get this DataFrame:

Dataset Example

The ultimate goal is to select all the rows that contain specific substrings in the above Pandas DataFrame. Here are 5 scenarios:

5 Scenarios to Select Rows that Contain a Substring in Pandas DataFrame

(1) Get all rows that contain a specific substring

To begin, let’s get all the months that contain the substring of ‘Ju‘ (for the months of ‘June’ and ‘July’):

import pandas as pd

data = {'Month': ['January','February','March','April','May','June','July','August','September','October','November','December'],
        'Days in Month': [31,28,31,30,31,30,31,31,30,31,30,31]
        }

df = pd.DataFrame(data, columns = ['Month', 'Days in Month'])

contain_values = df[df['Month'].str.contains('Ju')]
print (contain_values)

As you can see, the only two months that contain the substring of ‘Ju’ are June and July:

Selecting subset of data

Note that str.contains() is case sensitive. Meaning that if you specified ‘ju‘ (all in lowercase), while the original values contained a ‘J‘ in uppercase, then you won’t get any selection:

import pandas as pd

data = {'Month': ['January','February','March','April','May','June','July','August','September','October','November','December'],
        'Days in Month': [31,28,31,30,31,30,31,31,30,31,30,31]
        }

df = pd.DataFrame(data, columns = ['Month', 'Days in Month'])

contain_values = df[df['Month'].str.contains('ju')]
print (contain_values)

In that case, you’ll get an empty DataFrame:

Empty DataFrame

(2) Get all rows that contain one substring OR another substring

Now let’s get all the months that contain EITHER ‘Ju‘ OR ‘Ma

In that case, you’ll need to use the pipe symbol (‘|’) as follows:

import pandas as pd

data = {'Month': ['January','February','March','April','May','June','July','August','September','October','November','December'],
        'Days in Month': [31,28,31,30,31,30,31,31,30,31,30,31]
        }

df = pd.DataFrame(data, columns = ['Month', 'Days in Month'])

contain_values = df[df['Month'].str.contains('Ju|Ma')]
print (contain_values)

Here are the months that you’ll get:

Select Rows Containing a Substring in Pandas DataFrame

(3) Get all rows that do NOT contain given substrings

Now let’s select all the months that neither contain ‘Ju’ nor ‘Ma’

In that case, apply the ‘~’ symbol before the df[‘Month’]:

import pandas as pd

data = {'Month': ['January','February','March','April','May','June','July','August','September','October','November','December'],
        'Days in Month': [31,28,31,30,31,30,31,31,30,31,30,31]
        }

df = pd.DataFrame(data, columns = ['Month', 'Days in Month'])

contain_values = df[~df['Month'].str.contains('Ju|Ma')]
print (contain_values)

Run the code, and you’ll get the following months:

Select Rows Containing a Substring in Pandas DataFrame

(4) Get all rows that contain specific substring in the middle of a string

Let’s get all the months that contain ‘uar‘ (for January and February):

import pandas as pd

data = {'Month': ['January','February','March','April','May','June','July','August','September','October','November','December'],
        'Days in Month': [31,28,31,30,31,30,31,31,30,31,30,31]
        }

df = pd.DataFrame(data, columns = ['Month', 'Days in Month'])

contain_values = df[df['Month'].str.contains('uar')]
print (contain_values)

You’ll now see the months of January and February:

Select Rows that contain a Substring in Pandas DataFrame

(5) Get all rows that contain a specific numeric value

What if you’d like to select all the rows that contain a specific numeric value?

For example, what if you want to select all the rows which contain the numeric value of ‘0‘ under the ‘Days in Month’ column?

In that case, you’ll need to convert the ‘Days in Month’ column from integers to strings before you can apply the str.contains():

import pandas as pd

data = {'Month': ['January','February','March','April','May','June','July','August','September','October','November','December'],
        'Days in Month': [31,28,31,30,31,30,31,31,30,31,30,31]
        }

df = pd.DataFrame(data, columns = ['Month', 'Days in Month'])

contain_values = df[df['Days in Month'].astype(str).str.contains('0')]
print (contain_values)

As you can see, only the months that contain the numeric value of ‘0‘ were selected:

Get Rows Containing a Substring in Pandas DataFrame

You can read more about str.contains by visiting the Pandas Documentation.