Python Performance Optimization — Profile, Tune & Speed Up Your Code
Python is "slow" — and it doesn't matter for 90% of what you build. But for the other 10%, you need to know where the time goes and how to fix it. This guide covers the full optimization workflow: measure, identify bottlenecks, optimize what matters, and verify the improvement.
Step 1: Profiling — Find the Bottleneck
Quick timing with timeit
import timeit
# Time a single expression
result = timeit.timeit(
'sum(range(10_000))',
number=1000,
)
print(f"{result:.3f}s for 1000 iterations")
# Time a function
def my_function():
return [x ** 2 for x in range(10_000)]
result = timeit.timeit(my_function, number=100)
print(f"{result:.3f}s for 100 calls")
# Compare two approaches
setup = "data = list(range(10_000))"
t1 = timeit.timeit("sorted(data)", setup=setup, number=1000)
t2 = timeit.timeit("data.sort()", setup=setup, number=1000)
print(f"sorted(): {t1:.3f}s | .sort(): {t2:.3f}s")
# .sort() is faster — in-place, no copy
cProfile — function-level profiling
import cProfile
import pstats
from io import StringIO
def profile_it(func, *args, **kwargs):
"""Profile a function and print top bottlenecks."""
profiler = cProfile.Profile()
profiler.enable()
result = func(*args, **kwargs)
profiler.disable()
stream = StringIO()
stats = pstats.Stats(profiler, stream=stream)
stats.sort_stats("cumulative")
stats.print_stats(20) # top 20 functions
print(stream.getvalue())
return result
# Example: profile a data processing pipeline
def process_data():
import json
data = [{"id": i, "value": i ** 2} for i in range(100_000)]
serialized = json.dumps(data)
parsed = json.loads(serialized)
filtered = [d for d in parsed if d["value"] > 50_000]
return sorted(filtered, key=lambda x: x["value"], reverse=True)[:100]
profile_it(process_data)
# From command line:
python -m cProfile -s cumulative my_script.py
# Save profile for later analysis:
python -m cProfile -o profile.prof my_script.py
# Visualize with snakeviz (pip install snakeviz):
snakeviz profile.prof
line_profiler — line-by-line timing
# pip install line_profiler
# Add @profile decorator to functions you want to profile
@profile
def slow_function(n):
total = 0
for i in range(n): # How much time here?
total += i ** 2 # vs here?
result = [x for x in range(n) if x % 2 == 0] # vs here?
return total, result
# Run with:
# kernprof -l -v my_script.py
# Output shows time per line:
# Line # Hits Time Per Hit % Time Line Contents
# =====================================================
# 4 1 0.0 0.0 0.0 total = 0
# 5 100001 12.3 0.0 15.2 for i in range(n):
# 6 100000 55.7 0.0 68.8 total += i ** 2
# 7 1 12.9 12.9 15.9 result = [...]
memory_profiler — track memory usage
# pip install memory_profiler
from memory_profiler import profile as mem_profile
@mem_profile
def memory_heavy():
a = [i for i in range(1_000_000)] # ~8MB
b = {i: i ** 2 for i in range(1_000_000)} # ~80MB
del a # free ~8MB
c = list(b.values()) # ~8MB
return len(c)
# Run: python -m memory_profiler my_script.py
# Shows memory usage per line with increments
Step 2: Data Structures — Pick the Right One
The #1 performance fix is usually choosing the right data structure. Here's when each one wins:
| Operation | list | set | dict | deque |
|---|---|---|---|---|
| Lookup (x in) | O(n) ❌ | O(1) ✅ | O(1) ✅ | O(n) |
| Append | O(1) ✅ | O(1) | O(1) | O(1) ✅ |
| Prepend | O(n) ❌ | — | — | O(1) ✅ |
| Pop from end | O(1) | — | — | O(1) |
| Pop from start | O(n) ❌ | — | — | O(1) ✅ |
| Insert middle | O(n) | — | — | O(n) |
| Sort | O(n log n) | — | — | — |
# ❌ SLOW: checking membership in a list
users_list = list(range(1_000_000))
if 999_999 in users_list: # O(n) — scans entire list
pass
# ✅ FAST: checking membership in a set
users_set = set(range(1_000_000))
if 999_999 in users_set: # O(1) — hash lookup
pass
# Benchmark: set lookup is ~100,000x faster for 1M items
from collections import deque
# ❌ SLOW: using list as a queue
queue = []
for i in range(100_000):
queue.insert(0, i) # O(n) each time — shifts all elements
# ✅ FAST: using deque as a queue
queue = deque()
for i in range(100_000):
queue.appendleft(i) # O(1) each time
# deque is ~1000x faster for this pattern
__slots__ — reduce memory per object
# Without __slots__: each instance has a __dict__ (~200 bytes overhead)
class PointRegular:
def __init__(self, x, y, z):
self.x = x
self.y = y
self.z = z
# With __slots__: no __dict__, ~64 bytes per instance
class PointSlotted:
__slots__ = ('x', 'y', 'z')
def __init__(self, x, y, z):
self.x = x
self.y = y
self.z = z
# For 1M points: ~200MB vs ~64MB
# Use __slots__ when you have many instances of the same class
import sys
regular = PointRegular(1, 2, 3)
slotted = PointSlotted(1, 2, 3)
print(f"Regular: {sys.getsizeof(regular.__dict__) + sys.getsizeof(regular)} bytes")
print(f"Slotted: {sys.getsizeof(slotted)} bytes")
Step 3: Python-Specific Optimizations
Generators instead of lists
# ❌ Creates entire list in memory (40MB for 10M ints)
squares = [x ** 2 for x in range(10_000_000)]
total = sum(squares)
# ✅ Generator — processes one item at a time (near-zero memory)
squares = (x ** 2 for x in range(10_000_000))
total = sum(squares)
# Same result, ~40MB less memory
String concatenation
import time
# ❌ SLOW: string concatenation in loop
def concat_slow(n):
result = ""
for i in range(n):
result += f"item_{i}," # Creates new string each time
return result
# ✅ FAST: join a list
def concat_fast(n):
parts = [f"item_{i}" for i in range(n)]
return ",".join(parts)
# Benchmark with n=100,000:
# concat_slow: ~0.8s
# concat_fast: ~0.05s (16x faster)
Local variables are faster than globals
# Python looks up local variables ~20% faster than globals
import math
# ❌ Slower: global lookup on each iteration
def compute_slow(data):
return [math.sqrt(x) for x in data]
# ✅ Faster: localize the function reference
def compute_fast(data):
sqrt = math.sqrt # Local reference
return [sqrt(x) for x in data]
# Difference is small per call, but adds up in tight loops
Use built-in functions
# Built-ins are implemented in C — always faster than Python loops
data = list(range(1_000_000))
# ❌ Python loop
total = 0
for x in data:
total += x
# ✅ Built-in sum() — 5-10x faster
total = sum(data)
# ❌ Python loop for max
max_val = data[0]
for x in data:
if x > max_val:
max_val = x
# ✅ Built-in max() — 5-10x faster
max_val = max(data)
# Same applies to: min(), any(), all(), sorted(), map(), filter()
Step 4: NumPy Vectorization
For numerical work, NumPy is 10-100x faster than pure Python. It operates on entire arrays in C.
import numpy as np
# ❌ Pure Python: ~2.5 seconds
def distances_python(points1, points2):
result = []
for p1, p2 in zip(points1, points2):
d = sum((a - b) ** 2 for a, b in zip(p1, p2)) ** 0.5
result.append(d)
return result
# ✅ NumPy vectorized: ~0.02 seconds (125x faster)
def distances_numpy(points1, points2):
p1 = np.array(points1)
p2 = np.array(points2)
return np.sqrt(np.sum((p1 - p2) ** 2, axis=1))
# Generate test data
n = 1_000_000
pts1 = [(i, i+1, i+2) for i in range(n)]
pts2 = [(i+3, i+4, i+5) for i in range(n)]
# Common NumPy speedups:
# ❌ Python loop to filter
filtered = [x for x in data if x > threshold]
# ✅ NumPy boolean indexing
arr = np.array(data)
filtered = arr[arr > threshold] # 50x faster
# ❌ Python loop for statistics
mean = sum(data) / len(data)
std = (sum((x - mean) ** 2 for x in data) / len(data)) ** 0.5
# ✅ NumPy
mean = np.mean(arr)
std = np.std(arr) # 20x faster
# ❌ Element-wise operations
result = [a * b + c for a, b, c in zip(x, y, z)]
# ✅ Vectorized
result = x * y + z # 100x faster
Step 5: Caching
from functools import lru_cache, cache
import time
# @cache (Python 3.9+) — unbounded cache
@cache
def fibonacci(n: int) -> int:
if n < 2:
return n
return fibonacci(n - 1) + fibonacci(n - 2)
# Without cache: fibonacci(35) takes ~3 seconds
# With cache: fibonacci(35) takes ~0.00001 seconds
# @lru_cache — bounded cache with eviction
@lru_cache(maxsize=1024)
def expensive_query(user_id: int, date: str) -> dict:
"""Simulates a slow database query."""
time.sleep(0.1) # DB latency
return {"user_id": user_id, "date": date, "data": "..."}
# First call: 100ms. Subsequent calls with same args: instant.
# Cache info:
print(expensive_query.cache_info())
# CacheInfo(hits=42, misses=10, maxsize=1024, currsize=10)
# TTL cache (time-based expiration)
import time
from functools import wraps
def ttl_cache(seconds: int = 300):
"""Cache with time-to-live expiration."""
def decorator(fn):
_cache = {}
@wraps(fn)
def wrapper(*args):
now = time.time()
if args in _cache:
result, timestamp = _cache[args]
if now - timestamp < seconds:
return result
result = fn(*args)
_cache[args] = (result, now)
return result
wrapper.clear = _cache.clear
wrapper.cache = _cache
return wrapper
return decorator
@ttl_cache(seconds=60)
def get_exchange_rate(currency: str) -> float:
"""Fetches rate from API, cached for 60s."""
import requests
resp = requests.get(f"https://api.example.com/rates/{currency}")
return resp.json()["rate"]
Step 6: Concurrency for I/O
If your code is slow because of network/disk I/O, not CPU — use async or threading. See our async programming guide and concurrency guide for deeper coverage.
import asyncio
import aiohttp
import time
# ❌ Sequential: 10 API calls × 200ms = 2 seconds
def fetch_sequential(urls):
import requests
return [requests.get(url).json() for url in urls]
# ✅ Async concurrent: 10 API calls in ~200ms total
async def fetch_concurrent(urls):
async with aiohttp.ClientSession() as session:
tasks = [session.get(url) for url in urls]
responses = await asyncio.gather(*tasks)
return [await r.json() for r in responses]
# ✅ With concurrency limit (don't DDoS the API)
async def fetch_limited(urls, max_concurrent=5):
semaphore = asyncio.Semaphore(max_concurrent)
async def fetch_one(session, url):
async with semaphore:
async with session.get(url) as resp:
return await resp.json()
async with aiohttp.ClientSession() as session:
return await asyncio.gather(
*[fetch_one(session, url) for url in urls]
)
Step 7: When Python Isn't Enough
Cython — compile Python to C
# primes.pyx (Cython file)
def find_primes_cy(int limit):
"""Find primes up to limit — Cython version."""
cdef int i, j
cdef bint is_prime
cdef list result = []
for i in range(2, limit):
is_prime = True
for j in range(2, int(i ** 0.5) + 1):
if i % j == 0:
is_prime = False
break
if is_prime:
result.append(i)
return result
# Compile: cythonize -i primes.pyx
# Usage: from primes import find_primes_cy
# ~30x faster than pure Python for this algorithm
multiprocessing — use all CPU cores
from multiprocessing import Pool
import os
def cpu_heavy_task(data_chunk):
"""CPU-intensive work that benefits from parallelism."""
return sum(x ** 2 for x in data_chunk)
def parallel_process(data, n_workers=None):
n_workers = n_workers or os.cpu_count()
chunk_size = len(data) // n_workers
chunks = [data[i:i + chunk_size] for i in range(0, len(data), chunk_size)]
with Pool(n_workers) as pool:
results = pool.map(cpu_heavy_task, chunks)
return sum(results)
# Sequential: ~4 seconds (1 core)
# Parallel with 4 cores: ~1.1 seconds (3.6x speedup)
data = list(range(10_000_000))
result = parallel_process(data)
Optimization Checklist
- Profile first — find the actual bottleneck with cProfile or line_profiler
- Right data structure — set for lookups, deque for queues, dict for key-value
- Built-in functions — sum, max, sorted, map are C-speed
- Generators — for large datasets, process items lazily
- Caching — @lru_cache for pure functions with repeated calls
- NumPy — vectorize numerical operations
- Async I/O — concurrent network/disk operations
- multiprocessing — use all CPU cores for CPU-bound work
- Cython/C extensions — last resort for hot loops
- Better algorithm — O(n log n) beats optimized O(n²) every time
🚀 Want optimized Python scripts, automation tools, and performance templates?
Related Articles
- Python Async Programming — speed up I/O-bound code with asyncio
- Python Concurrency — threading, multiprocessing, and async compared
- Build a Data Pipeline in Python — optimized ETL patterns
- Python Logging & Monitoring — track performance in production
- Python Testing Guide — benchmark tests with pytest-benchmark
Need help optimizing a slow Python application? I profile, tune, and rebuild performance-critical code. Reach out on Telegram →