In this short guide, you’ll see how to extract specific characters within a string using Pandas.
The goal, in each of the 8 scenarios below, is to extract only the digits within a string:
- From the left
- From the right
- From the middle
- Before a symbol
- Before a space
- After a symbol
- Between identical symbols
- Between different symbols
(1) Extract the five digits from the left using str[:5]:
import pandas as pd
data = {"Identifier": ["55555-abc", "77777-xyz", "99999-mmm"]}
df = pd.DataFrame(data)
df["Identifier"] = df["Identifier"].str[:5]
print(df)
The result:
Identifier
0 55555
1 77777
2 99999
(2) Extract the five digits from the right using str[-5:]:
import pandas as pd
data = {"Identifier": ["ID-55555", "ID-77777", "ID-99999"]}
df = pd.DataFrame(data)
df["Identifier"] = df["Identifier"].str[-5:]
print(df)
This result:
Identifier
0 55555
1 77777
2 99999
(3) Extract the five digits from the middle using str[3:8]:
import pandas as pd
data = {"Identifier": ["ID-55555-End", "ID-77777-End", "ID-99999-End"]}
df = pd.DataFrame(data)
df["Identifier"] = df["Identifier"].str[3:8]
print(df)
The result:
Identifier
0 55555
1 77777
2 99999
(4) Extract the digits before a symbol (“-“) using str.split(“-“).str[0]:
import pandas as pd
data = {"Identifier": ["111-AA", "2222222-BB", "33-CC"]}
df = pd.DataFrame(data)
df["Identifier"] = df["Identifier"].str.split("-").str[0]
print(df)
The result:
Identifier
0 111
1 2222222
2 33
(5) Extract the digits before a space (” “) using str.split(” “).str[0]:
import pandas as pd
data = {"Identifier": ["111 AA", "2222222 BB", "33 CC"]}
df = pd.DataFrame(data)
df["Identifier"] = df["Identifier"].str.split(" ").str[0]
print(df)
The result:
Identifier
0 111
1 2222222
2 33
(6) Extract the digits after a symbol (“-“) using str.split(“-“).str[1]:
import pandas as pd
data = {"Identifier": ["AA-111", "BB-2222222", "CC-33"]}
df = pd.DataFrame(data)
df["Identifier"] = df["Identifier"].str.split("-").str[1]
print(df)
The result:
Identifier
0 111
1 2222222
2 33
(7) Extract the digits between identical symbols (“-“) using str.split(“-“).str[1]:
import pandas as pd
data = {"Identifier": ["AA-111-AA", "BB-2222222-B", "CC-33-CCC"]}
df = pd.DataFrame(data)
df["Identifier"] = df["Identifier"].str.split("-").str[1]
print(df)
The result:
Identifier
0 111
1 2222222
2 33
(8) Extract the digits between different symbols:
import pandas as pd
data = {"Identifier": ["AA-111$AA", "BB-2222222$B", "CC-33$CCC"]}
df = pd.DataFrame(data)
df["Identifier"] = df["Identifier"].str.split("-").str[1]
df["Identifier"] = df["Identifier"].str.split("$").str[0]
print(df)
The result:
Identifier
0 111
1 2222222
2 33
You can find many examples about working with text data by visiting the Pandas Documentation.