Error Handling Patterns for Agentic Systems

Trigger

API returns error / timeout / rate limit

External service is unavailable, slow, or rejecting requests. Clear error signal.

Detection

Easy

Detect

HTTP error codes, timeout exceptions, connection errors

Standard error handling catches these. The challenge is deciding what the agent should do next.

Recovery Patterns

Retry with backoff

Exponential backoff for transient failures. Set max retries to prevent loops.

Circuit breaker

After N failures, stop trying. Route to fallback or fail fast with clear error.

Alternative tool

Switch to a backup data source or different API endpoint.

Graceful degradation

Produce partial result without the failed tool's contribution. Flag the gap.

Trigger

Agent misinterprets tool response or generates wrong output

Tool succeeds. Model processes result. Interpretation is wrong. No error signal — system reports success.

Detection

Hard

Detect

Output validation, fact-checking against source data, confidence scoring

Requires domain-specific validation layers that check meaning, not just format.

Recovery Patterns

Context-adjusted retry

Re-prompt with additional guidance about the specific misinterpretation.

Deterministic fallback

Fall back to rule-based parser or template when LLM extraction fails validation.

Human escalation

Surface the output with validation failure details for human review.

Multi-model verification

Cross-check with a second model or different prompt strategy.

Trigger

Context window fills with tool responses and conversation history

Progressive quality erosion. Agent loses track of earlier information. Contradicts prior statements. Tool selections become less precise.

Detection

Moderate

Detect

Latency trends, tool selection patterns, output verbosity changes

Behavioral monitoring over the session. No single indicator — combined signal detection.

Recovery Patterns

Context summarization

Periodically compress conversation history while preserving key facts.

Session boundaries

Split long tasks into bounded sessions with explicit state handoff.

Selective retention

Keep only the most relevant prior context. Drop completed sub-task details.

External state

Persist critical state outside the context window. Agent retrieves as needed.

Trigger

Agent A produces incorrect output → Agent B builds on it → Agent C delivers wrong result

Error amplifies through the chain. Each agent adds confidence. Final output is coherent, well-structured, and wrong.

Detection

Very Hard

Detect

Schema validation at handoffs, cross-agent fact verification, distributed tracing

Requires validation at every agent boundary. Cross-agent tracing to find the origin point.

Recovery Patterns

Boundary validation

Schema and semantic validation at every inter-agent handoff point.

Independent verification

Downstream agents re-derive key facts from their own tools before accepting upstream claims.

Checkpoint rollback

Resume from the last validated checkpoint rather than the point of failure.

Compensation

Issue corrections for committed side effects. Track which effects need compensating.

Trigger

Retry loops, re-planning cycles, unbounded conversations

System functions correctly at each step. Accumulated cost far exceeds value. No functional error to catch.

Detection

Moderate

Detect

Token budget tracking, step counters, wall-clock timeouts

Explicit budgets per task, per agent, and per session. Alert at thresholds.

Recovery Patterns

Token budgets

Hard limits per task and agent. Produce best-effort output when budget is reached.

Step limits

Maximum reasoning steps and tool calls per task. Prevents infinite loops.

Conversation bounds

Maximum turns in agent-to-agent negotiation. Escalate if no convergence.

Cost circuit breaker

System-wide spend threshold that pauses non-critical work automatically.