At times, you may need to extract specific characters within a string. You may then apply the concepts of Left, Right, and Mid in Pandas to obtain your desired characters within a string.
In this tutorial, you’ll see the following 8 scenarios that describe how to extract specific characters:
- From the left
- From the right
- From the middle
- Before a symbol
- Before a space
- After a symbol
- Between identical symbols
- Between different symbols
Reviewing LEFT, RIGHT, MID in Pandas
For each of the above scenarios, the goal is to extract only the digits within the string. For example, for the string of ‘55555-abc‘ the goal is to extract only the digits of 55555.
Let’s now review the first case of obtaining only the digits from the left.
Scenario 1: Extract Characters From the Left
Suppose that you have the following 3 strings:
Identifier |
55555-abc |
77777-xyz |
99999-mmm |
You can capture those strings in Python using Pandas DataFrame.
Since you’re only interested to extract the five digits from the left, you may then apply the syntax of str[:5] to the ‘Identifier’ column:
import pandas as pd data = {'Identifier': ['55555-abc','77777-xyz','99999-mmm']} df = pd.DataFrame(data, columns= ['Identifier']) left = df['Identifier'].str[:5] print (left)
Once you run the Python code, you’ll get only the digits from the left:
0 55555
1 77777
2 99999
Scenario 2: Extract Characters From the Right
In this scenario, the goal is to get the five digits from the right:
Identifier |
ID-55555 |
ID-77777 |
ID-99999 |
To accomplish this goal, apply str[-5:] to the ‘Identifier’ column:
import pandas as pd data = {'Identifier': ['ID-55555','ID-77777','ID-99999']} df = pd.DataFrame(data, columns= ['Identifier']) right = df['Identifier'].str[-5:] print (right)
This will ensure that you’ll get the five digits from the right:
0 55555
1 77777
2 99999
Scenario 3: Extract Characters From the Middle
There are cases where you may need to extract the data from the middle of a string:
Identifier |
ID-55555-End |
ID-77777-End |
ID-99999-End |
To extract only the digits from the middle, you’ll need to specify the starting and ending points for your desired characters. In this case, the starting point is ‘3’ while the ending point is ‘8’ so you’ll need to apply str[3:8] as follows:
import pandas as pd data = {'Identifier': ['ID-55555-End','ID-77777-End','ID-99999-End']} df = pd.DataFrame(data, columns= ['Identifier']) mid = df['Identifier'].str[3:8] print (mid)
Only the five digits within the middle of the string will be retrieved:
0 55555
1 77777
2 99999
Scenario 4: Before a symbol
Say that you want to obtain all the digits before the dash symbol (‘-‘):
Identifier |
111-IDAA |
2222222-IDB |
33-IDCCC |
Even if your string length changes, you can still retrieve all the digits from the left by adding the two components below:
- str.split(‘-‘) – where you’ll need to place the symbol within the brackets. In our case, it is the dash symbol
- str[0] – where you’ll need to place 0 to get the characters from the left
And here is the complete Python code:
import pandas as pd data = {'Identifier': ['111-IDAA','2222222-IDB','33-IDCCC']} df = pd.DataFrame(data, columns= ['Identifier']) before_symbol = df['Identifier'].str.split('-').str[0] print (before_symbol)
And the result:
0 111
1 2222222
2 33
Scenario 5: Before a space
What if you have a space within the string?
Identifier |
111 IDAA |
2222222 IDB |
33 IDCCC |
In that case, simply leave a blank space within the split: str.split(‘ ‘)
import pandas as pd data = {'Identifier': ['111 IDAA','2222222 IDB','33 IDCCC']} df = pd.DataFrame(data, columns= ['Identifier']) before_space = df['Identifier'].str.split(' ').str[0] print (before_space)
Only the digits from the left will be obtained:
0 111
1 2222222
2 33
Scenario 6: After a symbol
You may also face situations where you’d like to get all the characters after a symbol (such as the dash symbol for example) for varying-length strings:
Identifier |
IDAA-111 |
IDB-2222222 |
IDCCC-33 |
In this case, you’ll need to adjust the value within the str[] to 1, so that you’ll obtain the desired digits from the right:
import pandas as pd data = {'Identifier': ['IDAA-111','IDB-2222222','IDCCC-33']} df = pd.DataFrame(data, columns= ['Identifier']) after_symbol = df['Identifier'].str.split('-').str[1] print (after_symbol)
Here is the output from Python:
0 111
1 2222222
2 33
Scenario 7: Between identical symbols
Now what if you want to retrieve the values between two identical symbols (such as the dash symbols) for varying-length strings:
Identifier |
IDAA-111-AA |
IDB-2222222-B |
IDCCC-33-CCC |
In that case, set:
- str.split(‘-‘)
- str[1]
So your full Python code would look like this:
import pandas as pd data = {'Identifier': ['IDAA-111-AA','IDB-2222222-B','IDCCC-33-CCC']} df = pd.DataFrame(data, columns= ['Identifier']) between_two_symbols = df['Identifier'].str.split('-').str[1] print (between_two_symbols)
You’ll get all the digits between the two dash symbols:
0 111
1 2222222
2 33
Scenario 8: Between different symbols
For the final scenario, the goal is to obtain the digits between two different symbols (the dash symbol and the dollar symbol):
Identifier |
IDAA-111$AA |
IDB-2222222$B |
IDCCC-33$CCC |
To accomplish this goal:
- First, set the variable (i.e., between_two_different_symbols) to obtain all the characters after the dash symbol
- Then, set the same variable to obtain all the characters before the dollar symbol
This is how you code would look like:
import pandas as pd data = {'Identifier': ['IDAA-111$AA','IDB-2222222$B','IDCCC-33$CCC']} df = pd.DataFrame(data, columns= ['Identifier']) between_two_different_symbols = df['Identifier'].str.split('-').str[1] between_two_different_symbols = between_two_different_symbols.str.split('$').str[0] print (between_two_different_symbols)
And the result:
0 111
1 2222222
2 33
Conclusion – LEFT, RIGHT, MID in Pandas
You just saw how to apply Left, Right, and Mid in Pandas. The concepts reviewed in this tutorial can be applied across large number of different scenarios.
You can find many examples about working with text data by visiting the Pandas Documentation.