Error Handling and Recovery in Agentic Systems

The most honest thing you can say about error handling in agentic systems is that most production deployments do not have enough of it. The reason is understandable: during development, the happy path works remarkably well. The LLM reasons correctly, the tools return clean data, the output looks right. You ship it. Then production happens — messy inputs, unavailable APIs, rate limits, stale data, ambiguous instructions, and the model’s creative interpretation of edge cases that never appeared in your test suite. The system does not crash. It does something worse: it produces confident, plausible output that is partially or entirely wrong, and nobody notices until a customer complains or an audit reveals the problem.

Error handling for agents requires different patterns than traditional software because the failure modes are different. An API that returns a 500 is a problem you can catch, log, and retry. An agent that misinterprets a tool response and confidently proceeds with a wrong premise is a problem that looks like success until it is not. This post covers the failure taxonomy, the recovery patterns, and the architectural decisions that make the difference between a system that fails gracefully and one that fails silently.

Error Handling Patterns

Five failure types in agentic systems, their detection difficulty, and the recovery patterns that address each. Click a failure type to explore.

The Failure Taxonomy

Before you can handle errors, you need to understand how agent systems actually fail. The taxonomy is broader than most teams expect.

Tool failures are the most familiar category, and the easiest to handle. An API returns an error, a database connection times out, an external service is unavailable. These produce clear signals — HTTP error codes, timeout exceptions, connection refused. Traditional error handling patterns (retry with backoff, circuit breakers, fallback data sources) apply directly. The challenge is not detecting these failures but deciding what the agent should do about them. A single failed tool call might be retryable, skippable, or process-ending depending on the context — and the agent needs to know the difference.

Semantic failures are the signature challenge of agentic systems. The tool succeeds, the model processes the response, but the interpretation is wrong. The agent queries a knowledge base and retrieves a document that is superficially relevant but actually about a different product version. The agent calls a calculation API correctly but misinterprets the unit of the result. The agent generates a customer response that is grammatically perfect, factually grounded in retrieved data, and completely misses the point of the customer’s question. There is no error code for semantic failure. The system reports success. Detection requires validation layers that check not just the format of the output but its meaning — a fundamentally harder problem.

Context degradation is a progressive failure mode. As a conversation extends, as tool responses accumulate, as intermediate reasoning fills the context window, the agent’s performance degrades. It starts losing track of earlier information. It begins contradicting its own prior statements. Its tool selections become less precise. This is not a sudden failure — it is a gradual erosion of quality that may not be apparent in any individual response but produces a visibly worse outcome by the end of a long interaction. Context degradation is the reason that agents that work flawlessly on short tasks can fail badly on long-running processes.

Cascade failures occur in multi-agent systems when one agent’s bad output becomes another agent’s confident input. Agent A extracts data incorrectly. Agent B receives that data and, having no reason to question it, builds an analysis on the wrong foundation. Agent C summarizes Agent B’s analysis for the end user. The error originated in Agent A, but it manifests as a coherent, well-structured, utterly wrong report from Agent C. The challenge is not just detecting the error but tracing it back through the cascade to its origin — which requires the kind of cross-agent observability that most systems do not build until after a cascade incident teaches them the hard way.

Cost runaway is a failure mode unique to LLM-based systems. An agent enters a retry loop, each retry consuming tokens. A hierarchical orchestrator repeatedly decomposes and re-plans a task that is fundamentally unsolvable with the available tools. A conversation-based negotiation between two agents loops without converging. The system is not broken in any functional sense — every individual step is executing correctly — but the accumulated cost far exceeds the value of the task. Without explicit budgets and circuit breakers, a single runaway agent can consume thousands of dollars in API costs before anyone notices.

Detection Patterns

Detecting tool failures requires nothing novel — standard error handling, timeouts, and health checks suffice. The harder categories require purpose-built detection mechanisms.

Output validation catches semantic failures at the point of generation. For structured output (JSON, specific formats), schema validation is a minimum baseline. For natural language output, validation is harder but not impossible: consistency checks against source data, fact extraction and verification against retrieved documents, and confidence scoring that flags low-certainty outputs for review. The key design decision is where to place the validation — inline (after every agent step, adding latency but catching errors early) or at the boundary (validating only the final output, cheaper but allowing errors to propagate internally).

Behavioral monitoring detects context degradation and runaway patterns. Track metrics that correlate with quality: response latency trends within a session (increasing latency often correlates with context window pressure), tool selection patterns (an agent that starts calling irrelevant tools is likely losing contextual coherence), and output length trends (responses that grow increasingly verbose or increasingly terse may indicate degradation). None of these metrics is definitive individually, but in combination they provide a reasonable signal for automated intervention.

Cross-agent validation addresses cascade failures. When Agent B receives output from Agent A, it should not accept that output uncritically. The simplest pattern is schema validation at every handoff point — if Agent A’s output does not conform to the expected schema, the handoff fails explicitly rather than propagating garbage. A more robust pattern is semantic validation: Agent B re-derives key facts from its own tools before accepting Agent A’s claims. This is expensive (it duplicates some of Agent A’s work) but provides a powerful check against cascade contamination.

Budget enforcement prevents cost runaway. Set explicit token budgets per task, per agent, and per session. When a budget threshold is reached, the system should not simply fail — it should produce the best possible output with the work completed so far and flag the budget exhaustion for review. This is analogous to timeout handling: the goal is graceful degradation rather than abrupt termination.

Recovery Patterns

Detection without recovery is just observability. The recovery patterns determine whether errors result in failed tasks, degraded outputs, or corrected results.

Retry with context adjustment. A naive retry sends the same input to the same model and hopes for a different result. This sometimes works because LLM output is non-deterministic, but it is a poor strategy for systematic failures. A better retry pattern adjusts the context: provide additional guidance about the previous failure, simplify the tool response that caused confusion, reduce the scope of the task to something within the agent’s reliable capability range. This is the error handling equivalent of task decomposition — when the full task fails, try a smaller version of it.

Fallback to simpler methods. When an agent cannot reliably handle a task, fall back to a deterministic alternative. An agent that is supposed to extract structured data from a document might fall back to a rule-based parser when the LLM extraction fails validation. A customer-facing agent that cannot generate a satisfactory response might fall back to retrieving a templated response from a knowledge base. The fallback is less flexible than the agent, but it is reliable and predictable — and reliability matters more than flexibility when the system has already demonstrated that it cannot handle the flexible approach.

Human escalation. Some errors require human judgment, and the system should be designed to surface them efficiently. The escalation should include the context the human needs to make a decision: the original input, the agent’s attempted output, the validation failure that triggered the escalation, and the agent’s intermediate reasoning if available. A well-designed escalation turns a potential customer-facing failure into a delayed but correct response. A poorly designed escalation dumps an incomprehensible log on a human who has no context and no time. Autonomy borders should define exactly when escalation happens and what information accompanies it.

Compensation and correction. When an error is detected after side effects have been committed — a message has been sent, a record has been updated, an action has been taken — the recovery pattern is compensation rather than rollback. Send a correction. Update the record again. Notify the affected parties. This requires that the system tracks which side effects each agent step has committed, so that the compensation logic knows what needs to be corrected. If you cannot trace from an error back to its committed side effects, you cannot compensate for it.

Circuit breakers. When a particular tool, agent, or coordination path repeatedly fails, stop trying it. The circuit breaker pattern from distributed systems applies directly: after N failures within a time window, mark the component as unavailable and route around it (if alternatives exist) or fail fast with a clear error (if they do not). This prevents the retry-and-fail loop that converts a single component failure into a system-wide latency and cost problem.

Designing for Recoverability

The patterns above work only if the system is designed with recoverability as a first-class architectural concern. Several design decisions are critical.

Separate reasoning from side effects. Structure agent execution so that the agent reasons about what to do, the system validates the proposed action, and only then does the action execute. This creates a natural intervention point where validation can catch errors before they become committed side effects. The action tool pattern of requiring confirmation for high-impact operations is an instance of this principle.

Make state explicit and persistent. At every significant boundary — between pipeline stages, before and after tool calls, at coordination handoff points — persist the current state. This enables checkpoint-and-resume recovery and makes the system’s behavior auditable after the fact. Implicit state that lives only in an agent’s context window is state that disappears when the agent fails and cannot be recovered.

Design idempotent operations. Every agent operation that might need to be retried should produce the same result when executed twice. For tool calls, this means using idempotency keys. For agent reasoning, this means structuring prompts so that the same input produces consistent output (lower temperature, structured output schemas, explicit constraints). Perfect idempotency is impossible with LLMs, but approximate idempotency — where retries produce equivalent rather than identical results — is achievable and sufficient.

Build error budgets, not error elimination. Accepting that agent systems will produce errors is a prerequisite for handling them well. Define acceptable error rates by task type. Monitor actual error rates against those budgets. Invest in detection and recovery for the error categories that exceed their budgets, rather than trying to eliminate all errors everywhere. This is the reliability engineering approach applied to agentic systems: measure, set targets, invest proportionally.

Key Takeaways

Agent failures differ fundamentally from traditional software bugs — semantic errors, context degradation, cascade contamination, and cost runaway produce confident-looking output that is partially or entirely wrong, making detection harder than correction. Effective error handling requires purpose-built detection at multiple levels: schema validation at every handoff point, behavioral monitoring for progressive degradation, cross-agent validation to prevent cascade failures, and explicit budget enforcement to prevent cost runaway. Recovery patterns — context-adjusted retries, fallback to deterministic methods, human escalation with sufficient context, compensation for committed side effects, and circuit breakers for persistent failures — only work when the system is designed for recoverability from the start, with explicit state persistence, separated reasoning and side effects, and idempotent operations as architectural foundations.

The Failure Taxonomy

Detection Patterns

Recovery Patterns

Designing for Recoverability

Key Takeaways

Continue reading

From Prototype to Production: What Actually Changes

Governance Patterns: Constraining Agents Without Killing Their Value

From Primitives to Patterns: How Building Blocks Become Systems