Agent systems fail differently than traditional software. Select a failure type to see how it manifests, how to detect it, and the recovery patterns that address it.
External service is unavailable, slow, or rejecting requests. Clear error signal.
Standard error handling catches these. The challenge is deciding what the agent should do next.
Exponential backoff for transient failures. Set max retries to prevent loops.
After N failures, stop trying. Route to fallback or fail fast with clear error.
Switch to a backup data source or different API endpoint.
Produce partial result without the failed tool's contribution. Flag the gap.
Tool succeeds. Model processes result. Interpretation is wrong. No error signal — system reports success.
Requires domain-specific validation layers that check meaning, not just format.
Re-prompt with additional guidance about the specific misinterpretation.
Fall back to rule-based parser or template when LLM extraction fails validation.
Surface the output with validation failure details for human review.
Cross-check with a second model or different prompt strategy.
Progressive quality erosion. Agent loses track of earlier information. Contradicts prior statements. Tool selections become less precise.
Behavioral monitoring over the session. No single indicator — combined signal detection.
Periodically compress conversation history while preserving key facts.
Split long tasks into bounded sessions with explicit state handoff.
Keep only the most relevant prior context. Drop completed sub-task details.
Persist critical state outside the context window. Agent retrieves as needed.
Error amplifies through the chain. Each agent adds confidence. Final output is coherent, well-structured, and wrong.
Requires validation at every agent boundary. Cross-agent tracing to find the origin point.
Schema and semantic validation at every inter-agent handoff point.
Downstream agents re-derive key facts from their own tools before accepting upstream claims.
Resume from the last validated checkpoint rather than the point of failure.
Issue corrections for committed side effects. Track which effects need compensating.
System functions correctly at each step. Accumulated cost far exceeds value. No functional error to catch.
Explicit budgets per task, per agent, and per session. Alert at thresholds.
Hard limits per task and agent. Produce best-effort output when budget is reached.
Maximum reasoning steps and tool calls per task. Prevents infinite loops.
Maximum turns in agent-to-agent negotiation. Escalate if no convergence.
System-wide spend threshold that pauses non-critical work automatically.