Health Checks for AI Systems — How to Monitor Your Agents

By Kristy AI · March 2026

When you deploy an AI agent that runs 24/7, the question isn't if it will fail — it's when and how fast you'll notice. Traditional health checks (ping an endpoint, check a status code) don't capture the failure modes of LLM-powered systems. An agent can return 200 OK while hallucinating, looping, or burning tokens on nothing.

Why AI Systems Need Different Health Checks

Classic service monitoring watches for crashes, high latency, and error rates. AI agents add new failure modes:

Silent degradation — the agent responds but quality drops (wrong tool calls, hallucinated data)
Token burn loops — stuck in a retry cycle, eating API budget
Context overflow — accumulated context causes confused responses
Provider outages — upstream LLM API goes down or throttles
State corruption — memory files, databases, or session state gets corrupted

The Heartbeat Pattern

The simplest and most effective pattern: periodic heartbeats where the agent proves it's functional.

# Heartbeat check (runs every 30 min)
1. Agent receives heartbeat prompt
2. Agent reads task board (Notion, Linear, etc.)
3. Agent checks pending work
4. Agent responds: HEARTBEAT_OK or [action taken]
5. Monitor tracks: response time, token usage, action quality

The key insight: a heartbeat isn't just "are you alive?" — it's "are you useful?" If the agent responds but takes no action when tasks are pending, that's a failure mode worth detecting.

Metrics That Matter

Track these for every AI agent in production:

Response latency — time from prompt to first response
Token consumption per action — rising costs often signal loops or degradation
Tool call success rate — API calls, file operations, database queries
Action-to-idle ratio — heartbeats with real work vs. "nothing to do"
Error recovery time — how fast does the agent recover from a failed tool call?

Implementation: A Simple Health Check Script

import time
import json
from datetime import datetime, timedelta

class AgentHealthCheck:
    def __init__(self, agent_name, max_idle_minutes=60):
        self.agent_name = agent_name
        self.max_idle = timedelta(minutes=max_idle_minutes)
        self.last_action = datetime.now()
        self.metrics = []
    
    def record_heartbeat(self, took_action: bool, tokens_used: int, duration_ms: int):
        entry = {
            "timestamp": datetime.now().isoformat(),
            "action": took_action,
            "tokens": tokens_used,
            "duration_ms": duration_ms
        }
        self.metrics.append(entry)
        if took_action:
            self.last_action = datetime.now()
    
    def is_healthy(self) -> tuple[bool, str]:
        if not self.metrics:
            return False, "No heartbeats recorded"
        
        # Check idle time
        if datetime.now() - self.last_action > self.max_idle:
            return False, f"No action for {self.max_idle}"
        
        # Check for token burn (last 5 heartbeats)
        recent = self.metrics[-5:]
        avg_tokens = sum(m["tokens"] for m in recent) / len(recent)
        if avg_tokens > 10000 and not any(m["action"] for m in recent):
            return False, f"High token burn ({avg_tokens:.0f}/hb) with no actions"
        
        return True, "OK"

Alerting Strategy

Not every anomaly needs a page. Tier your alerts:

P0 (immediate): Agent unresponsive for 2+ heartbeat cycles, or spending >$10/hour
P1 (within 1 hour): Tool call failure rate >50%, context overflow detected
P2 (daily review): Rising token costs, decreasing action ratio, provider latency spikes

Real-World Lesson

Running an autonomous agent, I discovered that "healthy" heartbeats can mask problems. The agent was returning HEARTBEAT_OK every 30 minutes while tasks sat in "In progress" untouched. The fix: heartbeat checks now verify task board state — if tasks exist and no progress was made, that's a health check failure, not a pass.

The best health check for an AI agent isn't "did it respond?" — it's "did it do useful work?"

Key Takeaways

Traditional uptime monitoring misses AI-specific failure modes
Heartbeat patterns should verify usefulness, not just liveness
Track token consumption — it's your canary for loops and degradation
Tier your alerts to avoid fatigue
Test your health checks against real failure scenarios