Tool Resilience & Recovery¶

Cogent's resilience layer sits between the agent loop and every tool call. It provides two independently controllable tiers of recovery, configured through a single flat ResilienceConfig dataclass.

Two-Tier Model¶

Tier	Who decides?	When it runs
Systematic retry	Developer	Retries happen automatically before the LLM ever sees the error
Intelligent retry	LLM	Error context is returned to the agent loop; the model chooses what to do next

Both tiers are active inside agent.run() and any programmatic tool call.

Quick Start¶

from cogent import Agent
from cogent.agent.resilience import ResilienceConfig

# Default: 3 retries, exponential-jitter backoff, hand off to LLM on exhaustion
agent = Agent(name="Bot", model=model, tools=[...])

# Explicit config
agent = Agent(
    name="Bot",
    model=model,
    tools=[...],
    resilience=ResilienceConfig(
        max_retries=3,
        strategy="exponential_jitter",
    ),
)

Systematic Retry¶

The framework retries the failing tool call mechanically using the configured backoff schedule. The LLM is not consulted during this tier.

agent = Agent(
    name="ReliableBot",
    model=model,
    tools=[flaky_api],
    observer="detailed",       # see retry events in output
    resilience=ResilienceConfig(
        max_retries=3,
        strategy="exponential",
        base_delay=1.0,
        max_delay=30.0,
    ),
)
result = await agent.run("Fetch the report")

# Inspect retry events after the run
errors = result.events_of("tool.error")
print(f"Tool was retried {len(errors)} time(s)")

Backoff Strategies¶

Value	Behaviour
`"exponential_jitter"`	`base_delay × 2^(attempt-1)` plus random jitter (default)
`"exponential"`	`base_delay × 2^(attempt-1)`, no jitter
`"linear"`	`base_delay × attempt`
`"fixed"`	Constant `base_delay` between retries
`"none"`	No delay

Strategy values are case-insensitive strings or RetryStrategy enum members.

Retryable vs. Non-Retryable Errors¶

By default the policy retries on ConnectionError, TimeoutError, OSError, and any exception whose message contains common transient-error patterns ("rate limit", "503", "too many requests", etc.).

It does not retry on ValueError, TypeError, PermissionError, KeyError, or messages matching auth patterns ("401", "api key", "unauthorized", etc.).

Intelligent Retry¶

When systematic retries are exhausted, the error context (tool name, args, error message, attempt count) is automatically fed back into the agent loop. The LLM then decides what to do: try a different tool, reformulate the arguments, or explain why it cannot proceed.

agent = Agent(
    name="SearchBot",
    model=model,
    tools=[web_search, cached_search, local_index],
    observer="progress",
    resilience=ResilienceConfig(
        max_retries=0,             # fail on first error — let the LLM cascade
    ),
    instructions=(
        "Try web_search first, then cached_search, then local_index. "
        "When a tool fails, immediately try the next one."
    ),
)
result = await agent.run("Find information about the framework")

on_exhaustion="raise" propagates the exception to the caller instead. Use this when you want hard failures rather than graceful recovery.

Model Escalation¶

When a cheaper or faster model keeps failing structured output validation, fallback_model escalates to a more capable model for one extra attempt. The fallback sees the full correction history from previous failures.

agent = Agent(
    name="Extractor",
    model="gpt-5.4-nano",          # fast/cheap for first attempts
    resilience=ResilienceConfig(
        fallback_model="gpt-5.4",   # escalate to larger model on exhaustion
    ),
)

The escalation sequence: primary model attempts → self-correction retries → fallback model attempt → on_exhaustion behaviour.

Per-Tool Overrides¶

Override any ResilienceConfig field for specific tools:

resilience = ResilienceConfig(
    max_retries=3,
    strategy="exponential_jitter",
    tool_overrides={
        "payment_api": {"max_retries": 0},            # never retry payment calls
        "flaky_search": {"max_retries": 5, "base_delay": 0.5},
        "slow_report":  {"timeout_seconds": 300.0},   # 5-minute timeout
    },
)

Each override is a flat dict with any subset of ResilienceConfig fields.

Timeout¶

ResilienceConfig(
    max_retries=3,
    timeout_seconds=30.0,   # per-call timeout; None disables it
)

Timeout applies to each individual attempt. A timed-out call raises TimeoutError, which is retryable by default.

Observing Retries¶

Pass a string level to observer for inline output:

agent = Agent(
    name="Bot",
    model=model,
    tools=[...],
    observer="progress",   # shows retry events as they happen
    resilience=ResilienceConfig(max_retries=3),
)

Retry events appear in the output alongside the call and final outcome:

[Bot] [tool-call] a1b2c3d4
  flaky_api(query='data')
[Bot] [tool-failed] a1b2c3d4
  flaky_api ConnectionError: upstream timeout
[Bot] [tool-retry] a1b2c3d4 1/3
  flaky_api ConnectionError: upstream timeout
[Bot] [tool-retry] a1b2c3d4 2/3
  flaky_api ConnectionError: upstream timeout
[Bot] [tool-result] a1b2c3d4 (1.2s)
  flaky_api {status: 'ok', data: ...}

Or inspect after the run:

result = await agent.run("...")
retries = result.events_of("tool.retry")   # one event per failed attempt
for evt in retries:
    print(evt.data["tool_name"], evt.data["attempt"], evt.data["error"])

`ResilienceConfig` Reference¶

Field	Type	Default	Description
`max_retries`	`int`	`3`	Retry attempts after first failure. `0` = no retry.
`strategy`	`str \\| RetryStrategy`	`"exponential_jitter"`	Backoff strategy.
`base_delay`	`float`	`1.0`	Base delay in seconds.
`max_delay`	`float`	`60.0`	Delay cap in seconds.
`jitter_factor`	`float`	`0.25`	Jitter multiplier (exponential_jitter only).
`on_exhaustion`	`"raise" \\| "return"`	`"return"`	Behaviour when retries are exhausted (tools and structured output).
`fallback_model`	`str \\| BaseChatModel \\| None`	`None`	Escalate to a stronger model after primary retries are exhausted.
`timeout_seconds`	`float \\| None`	`60.0`	Per-call timeout. `None` disables.
`tool_overrides`	`dict[str, dict]`	`{}`	Per-tool field overrides.

Migration from < 1.18.0¶

The following APIs were removed in 1.18.0:

Removed	Replacement
`ResilienceConfig(retry_policy=RetryPolicy(...))`	`ResilienceConfig(max_retries=3, strategy="exponential")`
`ResilienceConfig.aggressive()`	`ResilienceConfig(max_retries=5, base_delay=0.5)`
`ResilienceConfig.fast_fail()`	`ResilienceConfig(max_retries=0, on_exhaustion="raise")`
`ResilienceConfig.balanced()`	`ResilienceConfig()` (default)
`CircuitBreaker`	Remove — use `on_exhaustion="return"` instead
`FallbackRegistry`	Remove — register fallback tools directly and let the LLM cascade
`RecoveryAction`	Remove — `on_exhaustion` covers the supported modes