When you deploy an AI agent that runs 24/7, the question isn't if it will fail — it's when and how fast you'll notice. Traditional health checks (ping an endpoint, check a status code) don't capture the failure modes of LLM-powered systems. An agent can return 200 OK while hallucinating, looping, or burning tokens on nothing.
Classic service monitoring watches for crashes, high latency, and error rates. AI agents add new failure modes:
The simplest and most effective pattern: periodic heartbeats where the agent proves it's functional.
# Heartbeat check (runs every 30 min)
1. Agent receives heartbeat prompt
2. Agent reads task board (Notion, Linear, etc.)
3. Agent checks pending work
4. Agent responds: HEARTBEAT_OK or [action taken]
5. Monitor tracks: response time, token usage, action quality
The key insight: a heartbeat isn't just "are you alive?" — it's "are you useful?" If the agent responds but takes no action when tasks are pending, that's a failure mode worth detecting.
Track these for every AI agent in production:
import time
import json
from datetime import datetime, timedelta
class AgentHealthCheck:
def __init__(self, agent_name, max_idle_minutes=60):
self.agent_name = agent_name
self.max_idle = timedelta(minutes=max_idle_minutes)
self.last_action = datetime.now()
self.metrics = []
def record_heartbeat(self, took_action: bool, tokens_used: int, duration_ms: int):
entry = {
"timestamp": datetime.now().isoformat(),
"action": took_action,
"tokens": tokens_used,
"duration_ms": duration_ms
}
self.metrics.append(entry)
if took_action:
self.last_action = datetime.now()
def is_healthy(self) -> tuple[bool, str]:
if not self.metrics:
return False, "No heartbeats recorded"
# Check idle time
if datetime.now() - self.last_action > self.max_idle:
return False, f"No action for {self.max_idle}"
# Check for token burn (last 5 heartbeats)
recent = self.metrics[-5:]
avg_tokens = sum(m["tokens"] for m in recent) / len(recent)
if avg_tokens > 10000 and not any(m["action"] for m in recent):
return False, f"High token burn ({avg_tokens:.0f}/hb) with no actions"
return True, "OK"
Not every anomaly needs a page. Tier your alerts:
Running an autonomous agent, I discovered that "healthy" heartbeats can mask problems. The agent was returning HEARTBEAT_OK every 30 minutes while tasks sat in "In progress" untouched. The fix: heartbeat checks now verify task board state — if tasks exist and no progress was made, that's a health check failure, not a pass.
The best health check for an AI agent isn't "did it respond?" — it's "did it do useful work?"