Text data often requires advanced manipulation to clean, extract, or transform it into a usable format. Python offers powerful libraries, such as “re” (regular expressions), for such tasks.
Here’s a guide with examples to help you master these techniques:
Examples
Example 1: Basic String Operations
To use basic string operations, such as getting the length of a string, converting it to uppercase, lowercase, and title case:
text = "Hello, World! This is a sample text."
# Basic string operations
print("Length:", len(text))
print("Uppercase:", text.upper())
print("Lowercase:", text.lower())
print("Title case:", text.title())
The result:
Length: 36
Uppercase: HELLO, WORLD! THIS IS A SAMPLE TEXT.
Lowercase: hello, world! this is a sample text.
Title case: Hello, World! This Is A Sample Text.
Example 2: Regular Expressions for Pattern Matching
To find all words in the text that start with an uppercase letter, and all words that start with a lowercase letter:
import re
# Define a sample string
text = "Hello, World! This is a sample text."
# Match words starting in an uppercase letter
upper_case_pattern = r"\b[A-Z]\w*"
upper_case_matches = re.findall(upper_case_pattern, text)
print("Match Upper Case:", upper_case_matches)
# Match words starting in a lowercase letter
lower_case_pattern = r"\b[a-z]\w*"
lower_case_matches = re.findall(lower_case_pattern, text)
print("Match Lower Case:", lower_case_matches)
The result:
Match Upper Case: ['Hello', 'World', 'This']
Match Lower Case: ['is', 'a', 'sample', 'text']
Uppercase pattern: \b[A-Z]\w*
- \b: ensures that the pattern matches only at the beginning of a word.
- [A-Z]: matches any uppercase letter from A to Z.
- \w*: matches zero or more word characters (letters, digits, or underscores).
Lowercase pattern: \b[a-z]\w*
- \b: ensures that the pattern matches only at the beginning of a word.
- [a-z]: matches any lowercase letter from a to z.
- \w*: matches zero or more word characters (letters, digits, or underscores).
Example 3: Replacing and Cleaning Text
To clean text by removing non-alphanumeric characters:
import re
# Sample text that includes special characters
text = "Hello, World! This is a sample text that includes #special characters."
# Remove non-alphanumeric characters
cleaned_text = re.sub(r"[^\w\s]", "", text)
print("Original text:", text)
print("Cleaned text:", cleaned_text)
The result:
Original text: Hello, World! This is a sample text that includes #special characters.
Cleaned text: Hello World This is a sample text that includes special characters
Where [^\w\s] means that:
- ^: matches any character NOT in the set that follows.
- \w: matches any word character, which includes alphanumeric characters (letters and digits) and underscores.
- \s: matches any whitespace character, such as spaces, tabs, and newline characters.
Combined: [^\w\s] matches any character that is NOT a word character (\w) or whitespace (\s).
Example 4: Extracting Information
Extraction of email addresses from the text using a regular expression pattern:
import re
# Sample text
text = "Text that includes email addresses john.doe@example.com and jane.smith@example.com"
# Extract email addresses
email_pattern = r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b"
emails = re.findall(email_pattern, text)
# Print the text and extracted emails
print("Text:", text)
print("Extracted Emails:", emails)
The result:
Text: Text that includes email addresses john.doe@example.com and jane.smith@example.com
Extracted Emails: ['john.doe@example.com', 'jane.smith@example.com']
Email pattern: \b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b
- \b: ensures that the pattern matches the start of an email address.
- [A-Za-z0-9._%+-]+: matches one or more occurrences of any letter (uppercase or lowercase), digit, or the characters (., _, %, +, and -) before the @ symbol.
- @: Matches the @ symbol in the email address.
- [A-Za-z0-9.-]+: matches one or more occurrences of any letter (uppercase or lowercase), digit, or the characters (. and -) after the @ symbol.
- \.: matches the dot in the email address.
- [A-Z|a-z]{2,}: matches two or more occurrences of any uppercase or lowercase letter (e.g., .com, .org, .net).
- \b: ensures that the pattern matches the end of an email address.
Example 5: Stripping Whitespace
To remove leading and trailing whitespace from the text using strip():
text = " This is a text that contains whitespace. "
# Strip leading and trailing whitespace
stripped_text = text.strip()
print(stripped_text)
The result:
This is a text that contains whitespace.
Example 6: Joining Text
To join a list of words into a single sentence using join():
words = ["Hello", "World", "Python"]
# Joining a list of words into a sentence
sentence = " ".join(words)
print(sentence)
The result:
Hello World Python