Python Regular Expressions — Complete Guide with Examples
Regular expressions are the Swiss army knife of text processing. They're ugly, powerful, and once you learn them, you'll use them everywhere — log parsing, data validation, web scraping, text cleanup. This guide covers Python's re module from basics to advanced patterns, with real examples you can actually use.
The Basics: re Module Functions
import re
text = "Order #12345 was placed on 2026-03-15 for $299.99"
# re.search — find first match anywhere in string
match = re.search(r"#(\d+)", text)
if match:
print(match.group()) # "#12345"
print(match.group(1)) # "12345" (captured group)
# re.match — match only at the START of string
match = re.match(r"Order", text) # ✅ matches
match = re.match(r"placed", text) # ❌ None (not at start)
# re.findall — find ALL matches, return list
numbers = re.findall(r"\d+", text)
print(numbers) # ['12345', '2026', '03', '15', '299', '99']
# re.findall with groups — returns captured groups only
dates = re.findall(r"(\d{4})-(\d{2})-(\d{2})", text)
print(dates) # [('2026', '03', '15')]
# re.sub — search and replace
cleaned = re.sub(r"\$[\d.]+", "[PRICE]", text)
print(cleaned) # "Order #12345 was placed on 2026-03-15 for [PRICE]"
# re.split — split by pattern
parts = re.split(r"\s+", "hello world foo")
print(parts) # ['hello', 'world', 'foo']
💡 Always use raw strings: Write r"\d+", not "\\d+". The r prefix prevents Python from interpreting backslashes before regex sees them.
Pattern Syntax Cheat Sheet
| Pattern | Matches | Example |
|---|---|---|
| . | Any character (except newline) | a.c → "abc", "a1c" |
| \d | Digit [0-9] | \d{3} → "123" |
| \w | Word char [a-zA-Z0-9_] | \w+ → "hello_42" |
| \s | Whitespace [ \t\n\r] | \s+ → " \t" |
| \D, \W, \S | Negated versions | \D+ → "abc" |
| ^ | Start of string | ^Hello |
| $ | End of string | world$ |
| * | 0 or more | ab*c → "ac", "abbc" |
| + | 1 or more | ab+c → "abc", "abbc" |
| ? | 0 or 1 | colou?r → "color", "colour" |
| {n,m} | Between n and m | \d{2,4} → "12", "1234" |
| [abc] | Character class | [aeiou] → vowels |
| [^abc] | Negated class | [^0-9] → non-digits |
| a|b | Alternation (or) | cat|dog |
| (...) | Capture group | (\d{4})-(\d{2}) |
| (?:...) | Non-capturing group | (?:cat|dog)s |
Groups and Named Captures
import re
# --- Basic groups ---
log = '2026-03-15 14:30:22 ERROR [auth] Login failed for user@example.com'
match = re.search(
r"(\d{4}-\d{2}-\d{2}) (\d{2}:\d{2}:\d{2}) (\w+) \[(\w+)\] (.+)",
log
)
if match:
date, time, level, module, message = match.groups()
print(f"Level: {level}, Module: {module}")
# Level: ERROR, Module: auth
# --- Named groups (?P<name>...) — much more readable ---
pattern = r"""
(?P<date>\d{4}-\d{2}-\d{2})\s+
(?P<time>\d{2}:\d{2}:\d{2})\s+
(?P<level>\w+)\s+
\[(?P<module>\w+)\]\s+
(?P<message>.+)
"""
match = re.search(pattern, log, re.VERBOSE)
if match:
print(match.group("level")) # "ERROR"
print(match.group("module")) # "auth"
print(match.groupdict())
# {'date': '2026-03-15', 'time': '14:30:22', 'level': 'ERROR',
# 'module': 'auth', 'message': 'Login failed for user@example.com'}
# --- Backreferences ---
# Find repeated words
text = "the the quick brown fox fox"
dupes = re.findall(r"\b(\w+)\s+\1\b", text)
print(dupes) # ['the', 'fox']
💡 Use re.VERBOSE for complex patterns: The re.VERBOSE (or re.X) flag lets you add whitespace and comments inside patterns. Makes regex maintainable.
Lookaheads and Lookbehinds
Zero-width assertions — they check what's around a match without including it in the result.
import re
# --- Lookahead (?=...) — match only if followed by ---
# Find numbers followed by "px"
text = "width: 100px; height: 200em; margin: 10px"
px_values = re.findall(r"\d+(?=px)", text)
print(px_values) # ['100', '10']
# --- Negative lookahead (?!...) — match only if NOT followed by ---
# Find numbers NOT followed by "px"
non_px = re.findall(r"\d+(?!px)\b", text)
print(non_px) # ['200']
# --- Lookbehind (?<=...) — match only if preceded by ---
# Extract price values after "$"
prices = "Items: $29.99, €15.50, $149.00, £30"
usd = re.findall(r"(?<=\$)\d+\.\d{2}", prices)
print(usd) # ['29.99', '149.00']
# --- Negative lookbehind (?<!...) — match only if NOT preceded by ---
# Find words not preceded by "@" (exclude mentions)
text = "hello @world from python"
words = re.findall(r"(?<!@)\b\w+\b", text)
print(words) # ['hello', 'from', 'python']
# --- Practical: password validation with lookaheads ---
def validate_password(password: str) -> tuple[bool, list[str]]:
"""Check password strength using lookaheads."""
errors = []
if len(password) < 8:
errors.append("At least 8 characters")
if not re.search(r"(?=.*[A-Z])", password):
errors.append("At least one uppercase letter")
if not re.search(r"(?=.*[a-z])", password):
errors.append("At least one lowercase letter")
if not re.search(r"(?=.*\d)", password):
errors.append("At least one digit")
if not re.search(r"(?=.*[!@#$%^&*])", password):
errors.append("At least one special character")
return len(errors) == 0, errors
ok, issues = validate_password("Hello123!")
print(ok, issues) # True []
ok, issues = validate_password("hello")
print(ok, issues) # False ['At least 8 characters', 'At least one uppercase letter', ...]
Compiled Patterns (Performance)
import re
# Compile once, use many times — 2-10x faster for repeated use
EMAIL_RE = re.compile(
r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}",
re.IGNORECASE,
)
# Use the compiled pattern
emails = [
"alice@example.com",
"not-an-email",
"bob@company.co.uk",
"bad@",
]
valid = [e for e in emails if EMAIL_RE.fullmatch(e)]
print(valid) # ['alice@example.com', 'bob@company.co.uk']
# Compile with flags
LOG_PATTERN = re.compile(r"""
^(?P<ip>[\d.]+)\s+ # IP address
-\s+-\s+ # Ident and auth (usually -)
\[(?P<date>[^\]]+)\]\s+ # Date in brackets
"(?P<method>\w+)\s+ # HTTP method
(?P<path>\S+)\s+ # Request path
HTTP/[\d.]+" # HTTP version
\s+(?P<status>\d{3}) # Status code
\s+(?P<size>\d+) # Response size
""", re.VERBOSE)
log_line = '192.168.1.1 - - [26/Mar/2026:14:30:22 +0000] "GET /api/users HTTP/1.1" 200 4523'
match = LOG_PATTERN.match(log_line)
if match:
print(match.groupdict())
# {'ip': '192.168.1.1', 'date': '26/Mar/2026:14:30:22 +0000',
# 'method': 'GET', 'path': '/api/users', 'status': '200', 'size': '4523'}
Practical Patterns You'll Actually Use
Extract data from structured text
import re
# --- Parse key-value pairs ---
config_text = """
host = localhost
port = 5432
database = myapp
debug = true
"""
pairs = dict(re.findall(r"^(\w+)\s*=\s*(.+)$", config_text, re.MULTILINE))
print(pairs)
# {'host': 'localhost', 'port': '5432', 'database': 'myapp', 'debug': 'true'}
# --- Extract URLs from text ---
text = "Visit https://example.com or http://api.test.io/v2/data?key=123"
urls = re.findall(r"https?://[^\s<>\"']+", text)
print(urls)
# ['https://example.com', 'http://api.test.io/v2/data?key=123']
# --- Parse CSV-like data (handling quoted fields) ---
line = 'John,"Smith, Jr.",42,"New York"'
fields = re.findall(r'(?:"([^"]*)")|([^,]+)', line)
values = [quoted or unquoted for quoted, unquoted in fields]
print(values) # ['John', 'Smith, Jr.', '42', 'New York']
# --- Clean HTML tags ---
html = "<p>Hello <b>world</b>, this is <a href='#'>a link</a>.</p>"
clean = re.sub(r"<[^>]+>", "", html)
print(clean) # "Hello world, this is a link."
Data validation patterns
import re
PATTERNS = {
# Phone: international format
"phone": re.compile(r"^\+?1?\d{9,15}$"),
# IPv4 address
"ipv4": re.compile(
r"^(?:(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\.){3}"
r"(?:25[0-5]|2[0-4]\d|[01]?\d\d?)$"
),
# ISO date (YYYY-MM-DD)
"date": re.compile(r"^\d{4}-(?:0[1-9]|1[0-2])-(?:0[1-9]|[12]\d|3[01])$"),
# Hex color (#RGB or #RRGGBB)
"hex_color": re.compile(r"^#(?:[0-9a-fA-F]{3}){1,2}$"),
# Slug (URL-safe string)
"slug": re.compile(r"^[a-z0-9]+(?:-[a-z0-9]+)*$"),
# Semantic version
"semver": re.compile(r"^\d+\.\d+\.\d+(?:-[\w.]+)?(?:\+[\w.]+)?$"),
# UUID v4
"uuid": re.compile(
r"^[0-9a-f]{8}-[0-9a-f]{4}-4[0-9a-f]{3}-[89ab][0-9a-f]{3}-[0-9a-f]{12}$",
re.IGNORECASE,
),
}
def validate(value: str, pattern_name: str) -> bool:
pattern = PATTERNS.get(pattern_name)
if not pattern:
raise ValueError(f"Unknown pattern: {pattern_name}")
return bool(pattern.match(value))
# Test
assert validate("192.168.1.1", "ipv4")
assert validate("2026-03-15", "date")
assert validate("#ff6600", "hex_color")
assert validate("hello-world-42", "slug")
assert validate("2.1.0-beta.1", "semver")
assert not validate("999.1.1.1", "ipv4")
assert not validate("2026-13-01", "date")
Log parsing and data extraction
import re
from collections import Counter
def parse_nginx_logs(log_text: str) -> list[dict]:
"""Parse nginx access log into structured records."""
pattern = re.compile(
r'(?P<ip>[\d.]+)\s+'
r'- - '
r'\[(?P<date>[^\]]+)\]\s+'
r'"(?P<method>\w+)\s+(?P<path>\S+)\s+HTTP/[\d.]+"\s+'
r'(?P<status>\d{3})\s+'
r'(?P<size>\d+)\s+'
r'"(?P<referrer>[^"]*)"\s+'
r'"(?P<user_agent>[^"]*)"'
)
records = []
for line in log_text.strip().split("\n"):
match = pattern.match(line)
if match:
d = match.groupdict()
d["status"] = int(d["status"])
d["size"] = int(d["size"])
records.append(d)
return records
def analyze_logs(records: list[dict]):
"""Quick analysis of parsed logs."""
status_counts = Counter(r["status"] for r in records)
top_paths = Counter(r["path"] for r in records).most_common(10)
errors = [r for r in records if r["status"] >= 400]
total_bytes = sum(r["size"] for r in records)
return {
"total_requests": len(records),
"status_distribution": dict(status_counts),
"top_paths": top_paths,
"error_count": len(errors),
"total_bytes": total_bytes,
}
Advanced: re.sub with Functions
import re
# --- Replace with a function (dynamic replacements) ---
def censor_emails(text: str) -> str:
"""Replace email addresses with censored versions."""
def mask(match):
email = match.group()
user, domain = email.split("@")
return f"{user[0]}***@{domain}"
return re.sub(
r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}",
mask,
text,
)
text = "Contact alice@example.com or bob.smith@company.co.uk"
print(censor_emails(text))
# "Contact a***@example.com or b***@company.co.uk"
# --- Template substitution ---
def expand_template(template: str, variables: dict) -> str:
"""Replace {{variable}} placeholders."""
def replacer(match):
key = match.group(1).strip()
return str(variables.get(key, match.group()))
return re.sub(r"\{\{\s*(\w+)\s*\}\}", replacer, template)
result = expand_template(
"Hello {{name}}, your order #{{order_id}} is {{status}}.",
{"name": "Alice", "order_id": "12345", "status": "shipped"},
)
print(result) # "Hello Alice, your order #12345 is shipped."
# --- Convert camelCase to snake_case ---
def camel_to_snake(name: str) -> str:
# Insert _ before uppercase letters, then lowercase everything
s1 = re.sub(r"(.)([A-Z][a-z]+)", r"\1_\2", name)
return re.sub(r"([a-z0-9])([A-Z])", r"\1_\2", s1).lower()
print(camel_to_snake("getUserById")) # "get_user_by_id"
print(camel_to_snake("HTTPSConnection")) # "https_connection"
print(camel_to_snake("parseJSON")) # "parse_json"
Flags Reference
| Flag | Short | Effect |
|---|---|---|
| re.IGNORECASE | re.I | Case-insensitive matching |
| re.MULTILINE | re.M | ^ and $ match line start/end |
| re.DOTALL | re.S | . matches newlines too |
| re.VERBOSE | re.X | Allow whitespace and comments in pattern |
| re.ASCII | re.A | \w, \d match ASCII only |
# Combine flags with |
pattern = re.compile(r"""
^from:\s+ # Sender line
(.+) # Capture sender name/email
$
""", re.VERBOSE | re.IGNORECASE | re.MULTILINE)
Common Pitfalls
- Greedy vs lazy: .* is greedy (matches as much as possible). Use .*? for lazy (minimal) matching. Critical when extracting content between delimiters.
- Forgetting raw strings: "\b" is a backspace character, r"\b" is a word boundary. Always use r"".
- Catastrophic backtracking: Patterns like (a+)+b can hang on non-matching input. Avoid nested quantifiers.
- Using regex for HTML: Don't parse HTML with regex. Use BeautifulSoup or lxml. Regex is fine for simple tag stripping, not for DOM manipulation.
- fullmatch vs match vs search: match() matches at the start, search() matches anywhere, fullmatch() matches the entire string. Use fullmatch() for validation.
# Greedy vs lazy
html = "<b>hello</b> and <b>world</b>"
# Greedy — matches too much
print(re.findall(r"<b>(.*)</b>", html))
# ['hello</b> and <b>world'] ← wrong!
# Lazy — matches correctly
print(re.findall(r"<b>(.*?)</b>", html))
# ['hello', 'world'] ← correct!
When NOT to Use Regex
Regex isn't always the answer. Python has better tools for some tasks:
- Simple string checks: Use str.startswith(), str.endswith(), "x" in s
- Splitting on a fixed delimiter: Use str.split(",")
- Replacing a fixed string: Use str.replace()
- Parsing JSON/XML/HTML: Use json, lxml, BeautifulSoup
- Complex grammars: Use a parser library like pyparsing or write a proper parser
- Path manipulation: Use pathlib
🚀 Want 50+ production-ready Python scripts including text processors, scrapers, and automation tools?
Related Articles
- Web Scraping with Python — regex for extracting data from pages
- Python Logging & Monitoring — parse log files with regex
- Build a Data Pipeline in Python — regex in ETL transforms
- Build a Professional CLI Tool — input validation with regex
- 5 Python Scripts Every Developer Should Have
Need custom text processing or data extraction scripts? I build Python automation tools and data pipelines. Reach out on Telegram →