Skip to content

re Module Complexity

The re module provides regular expression matching operations.

Pattern Compilation and Matching

Operation Time Space Notes
re.compile(pattern) O(n) O(n) n = pattern length
pattern.match(string) O(m) typical, O(n*m) worst O(m) Worst case with backtracking
pattern.search(string) O(m) typical, O(n*m) worst O(m) Searches full string
pattern.findall(string) O(n*m) O(k) k = number of matches
pattern.finditer(string) O(n) per match O(1) per match Lazy iteration
pattern.sub(repl, string) O(n*m) O(m) n = pattern, m = string
pattern.split(string) O(n*m) O(m) n = pattern, m = string
re.match(pattern, string) O(n + m) typical O(m) Compiles + matches; cached up to ~512; O(n*m) worst with backtracking
re.search(pattern, string) O(n + m) typical O(m) Compiles + matches; cached up to ~512; O(n*m) worst with backtracking

*Note: Worst case for catastrophic backtracking; typical cases are much better.

Pattern Caching

Operation Time Space Notes
re.match(pattern, s) O(n + m) O(n+m) Pattern compiled, cached up to ~512
compiled = re.compile(p) O(n) O(n) Explicit compilation
compiled.match(s) O(nm) O(m) Uses cached pattern

CPython caches the last ~512 compiled patterns automatically.

Common Operations

Basic Pattern Matching

import re

# Compile pattern once - O(n) where n = pattern length
pattern = re.compile(r'\d+')  # O(n)

# Use compiled pattern - O(m) per match, m = string length
text = "Number: 12345"
match = pattern.search(text)  # O(m)

if match:
    print(match.group())  # O(1)

Finding All Matches

import re

pattern = re.compile(r'\w+')
text = "Hello world from Python"

# Find all - O(m) where m = text length
matches = pattern.findall(text)  # O(m)
# Result: ['Hello', 'world', 'from', 'Python']

# Lazy iteration - O(1) per match
for match in pattern.finditer(text):  # O(1) each
    print(match.group())

Substitution

import re

pattern = re.compile(r'\d+')
text = "Numbers: 10, 20, 30"

# Replace all - O(m)
result = pattern.sub('X', text)  # O(m)
# Result: "Numbers: X, X, X"

# Replace with function - O(m) for matching, O(f) for replacements
def replace_func(match):
    return str(int(match.group()) * 2)

result = pattern.sub(replace_func, text)  # O(m + f)
# Result: "Numbers: 20, 40, 60"

Splitting

import re

pattern = re.compile(r',\s*')
text = "apple, banana, cherry"

# Split by pattern - O(m)
parts = pattern.split(text)  # O(m)
# Result: ['apple', 'banana', 'cherry']

Grouping and Extraction

import re

pattern = re.compile(r'(\d+)-(\w+)')
text = "123-abc"

# Extract groups - O(m)
match = pattern.search(text)  # O(m)
if match:
    full = match.group(0)    # O(1) - full match
    num = match.group(1)     # O(1) - first group
    word = match.group(2)    # O(1) - second group

# Get all groups - O(1) per group
groups = match.groups()  # All groups as tuple

Pattern Complexity

Simple Patterns (Linear Matching)

import re

# Simple patterns - O(m) matching
pattern = re.compile(r'hello')
text = "hello world" * 1000

match = pattern.search(text)  # O(m) - linear scan

Complex Patterns (Potential Exponential)

import re

# Be careful with patterns that can cause backtracking
# This pattern can cause exponential backtracking on non-matches
pattern = re.compile(r'(a+)+b')
text = 'a' * 25  # No 'b' at end

# This can be very slow!
# match = pattern.search(text)  # Potentially exponential!

Catastrophic Backtracking Examples

import re
import time

# AVOID: Nested quantifiers cause backtracking
bad_pattern = re.compile(r'(a+)+$')

# Time how long matching takes
start = time.time()
bad_pattern.search('a' * 20)
end = time.time()
print(f"Time: {end - start}s")  # Can be very slow!

# BETTER: Rewrite pattern to be atomic or non-backtracking
good_pattern = re.compile(r'a+$')
start = time.time()
good_pattern.search('a' * 1000000)
end = time.time()
print(f"Time: {end - start}s")  # Fast!

Performance Tips

Compile Patterns Once

import re

# Bad: recompiles pattern each time - O(n²)
for line in large_file:
    if re.search(r'\d+', line):  # Recompiles each time!
        process(line)

# Good: compile once - O(n)
pattern = re.compile(r'\d+')  # O(n)
for line in large_file:
    if pattern.search(line):  # O(m) per line
        process(line)

Use Lazy Iteration

import re

pattern = re.compile(r'\w+')
text = "word1 word2 word3 ... word10000"

# Bad: materializes all matches - O(m) memory
all_matches = pattern.findall(text)  # All in memory

# Good: lazy iteration - O(1) memory
for match in pattern.finditer(text):  # O(1) memory
    process(match)

Anchors for Efficiency

import re

# Bad: searches entire string
pattern = re.compile(r'start.*end')
text = "start middle end"
match = pattern.search(text)  # O(m)

# Better: use anchors to limit backtracking
pattern = re.compile(r'^start.*?end$')
text = "start middle end"
match = pattern.search(text)  # More efficient

Special Considerations

Raw Strings

import re

# Use raw strings to avoid double escaping
pattern = re.compile(r'\d+')  # O(n)
# NOT: re.compile('\\d+')  # confusing

# For file paths
pattern = re.compile(r'C:\\Users\\.*')  # Windows paths

Groups and Performance

import re

# Capturing groups have slight overhead
pattern = re.compile(r'(\d+)')  # With group
text = "12345"
match = pattern.search(text)  # O(m) + group overhead

# Non-capturing group - slightly faster
pattern = re.compile(r'(?:\d+)')  # Non-capturing
match = pattern.search(text)  # O(m)

Version Notes

  • Python 3.7+: regex module available as alternative
  • Python 3.x: re module is standard
  • Python 2.x: Similar but with different Unicode handling

Best Practices

Do:

  • Compile patterns once, reuse them
  • Use raw strings (r'...')
  • Use pattern.finditer() for lazy matching
  • Test patterns on realistic data
  • Use anchors (^, $) to limit search

Avoid:

  • Nested quantifiers (a+)+
  • Alternation with overlap (foo|fo)
  • Recompiling patterns in loops
  • Greedy quantifiers where not needed (use .*?)
  • Testing without considering catastrophic backtracking