Running AI agents 24/7 gets expensive fast. Claude Opus at $15/M output tokens, GPT-4 at $30/M — a chatty agent can burn $50-100/day easily. Here's how to cut that by 80% without noticeable quality loss.
Not every request needs your most expensive model. Route by complexity:
# Simple classification to pick the right model
def route_request(prompt, tools_needed):
if tools_needed and len(tools_needed) > 3:
return "claude-opus" # Complex multi-tool reasoning
elif len(prompt) > 10000:
return "claude-sonnet" # Long context, medium complexity
elif is_simple_query(prompt):
return "claude-haiku" # Simple Q&A, formatting, routing
else:
return "claude-sonnet" # Default middle tier
# Cost impact:
# Before: 100% Opus = $15/M output
# After: 20% Opus + 50% Sonnet + 30% Haiku = ~$5/M output (67% savings)
If your system prompt is 5000 tokens and you make 100 calls/hour, that's 500K tokens/hour just on the system prompt. Caching eliminates this:
import hashlib
import json
def cached_completion(prompt, model, cache_ttl=3600):
cache_key = hashlib.sha256(f"{model}:{prompt}".encode()).hexdigest()
cached = redis.get(cache_key)
if cached:
return json.loads(cached) # Free!
response = llm.complete(prompt, model=model)
redis.setex(cache_key, cache_ttl, json.dumps(response))
return response
Response caching works great for deterministic queries (data lookups, formatting, translations). Less useful for creative or context-dependent responses.
Instead of one API call per task, batch multiple tasks into one call:
# Bad: 10 API calls
for item in items:
result = llm.complete(f"Classify this: {item}")
# Good: 1 API call
prompt = "Classify each item:\n" + "\n".join(f"{i+1}. {item}" for i, item in enumerate(items))
results = llm.complete(prompt)
# Parse numbered results
Batching reduces per-request overhead and often produces better results (model sees patterns across items).
Running an autonomous AI agent with these optimizations: