Advanced String Manipulation Techniques in Python

Text data often requires advanced manipulation to clean, extract, or transform it into a usable format. Python offers powerful libraries, such as “re” (regular expressions), for such tasks.

Here’s a guide with examples to help you master these techniques:

Examples

Example 1: Basic String Operations

To use basic string operations, such as getting the length of a string, converting it to uppercase, lowercase, and title case:

text = "Hello, World! This is a sample text."

# Basic string operations
print("Length:", len(text))
print("Uppercase:", text.upper())
print("Lowercase:", text.lower())
print("Title case:", text.title())

The result:

Length: 36
Uppercase: HELLO, WORLD! THIS IS A SAMPLE TEXT.
Lowercase: hello, world! this is a sample text.
Title case: Hello, World! This Is A Sample Text.

Example 2: Regular Expressions for Pattern Matching

To find all words in the text that start with an uppercase letter, and all words that start with a lowercase letter:

import re

# Define a sample string
text = "Hello, World! This is a sample text."

# Match words starting in an uppercase letter
upper_case_pattern = r"\b[A-Z]\w*"
upper_case_matches = re.findall(upper_case_pattern, text)
print("Match Upper Case:", upper_case_matches)

# Match words starting in a lowercase letter
lower_case_pattern = r"\b[a-z]\w*"
lower_case_matches = re.findall(lower_case_pattern, text)
print("Match Lower Case:", lower_case_matches)

The result:

Match Upper Case: ['Hello', 'World', 'This']
Match Lower Case: ['is', 'a', 'sample', 'text']

Uppercase pattern: \b[A-Z]\w*

  • \b: ensures that the pattern matches only at the beginning of a word.
  • [A-Z]: matches any uppercase letter from A to Z.
  • \w*: matches zero or more word characters (letters, digits, or underscores).

Lowercase pattern: \b[a-z]\w*

  • \b: ensures that the pattern matches only at the beginning of a word.
  • [a-z]: matches any lowercase letter from a to z.
  • \w*: matches zero or more word characters (letters, digits, or underscores).

Example 3: Replacing and Cleaning Text

To clean text by removing non-alphanumeric characters:

import re

# Sample text that includes special characters
text = "Hello, World! This is a sample text that includes #special characters."

# Remove non-alphanumeric characters
cleaned_text = re.sub(r"[^\w\s]", "", text)
print("Original text:", text)
print("Cleaned text:", cleaned_text)

The result:

Original text: Hello, World! This is a sample text that includes #special characters.
Cleaned text: Hello World This is a sample text that includes special characters

Where [^\w\s] means that:

  • ^: matches any character NOT in the set that follows.
  • \w: matches any word character, which includes alphanumeric characters (letters and digits) and underscores.
  • \s: matches any whitespace character, such as spaces, tabs, and newline characters.

Combined: [^\w\s] matches any character that is NOT a word character (\w) or whitespace (\s).

Example 4: Extracting Information

Extraction of email addresses from the text using a regular expression pattern:

import re

# Sample text
text = "Text that includes email addresses john.doe@example.com and jane.smith@example.com"

# Extract email addresses
email_pattern = r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b"
emails = re.findall(email_pattern, text)

# Print the text and extracted emails
print("Text:", text)
print("Extracted Emails:", emails)

The result:

Text: Text that includes email addresses john.doe@example.com and jane.smith@example.com
Extracted Emails: ['john.doe@example.com', 'jane.smith@example.com']

Email pattern: \b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b

  • \b: ensures that the pattern matches the start of an email address.
  • [A-Za-z0-9._%+-]+: matches one or more occurrences of any letter (uppercase or lowercase), digit, or the characters (., _, %, +, and -) before the @ symbol.
  • @: Matches the @ symbol in the email address.
  • [A-Za-z0-9.-]+: matches one or more occurrences of any letter (uppercase or lowercase), digit, or the characters (. and -) after the @ symbol.
  • \.: matches the dot in the email address.
  • [A-Z|a-z]{2,}: matches two or more occurrences of any uppercase or lowercase letter (e.g., .com, .org, .net).
  • \b: ensures that the pattern matches the end of an email address.

Example 5: Stripping Whitespace

To remove leading and trailing whitespace from the text using strip():

text = "  This is a text that contains whitespace.   "

# Strip leading and trailing whitespace
stripped_text = text.strip()

print(stripped_text)

The result:

This is a text that contains whitespace.

Example 6: Joining Text

To join a list of words into a single sentence using join():

words = ["Hello", "World", "Python"]

# Joining a list of words into a sentence
sentence = " ".join(words)

print(sentence)

The result:

Hello World Python

Leave a Comment