Tool Resilience & Recovery¶
Cogent's resilience layer sits between the agent loop and every tool call.
It provides two independently controllable tiers of recovery, configured
through a single flat ResilienceConfig dataclass.
Two-Tier Model¶
| Tier | Who decides? | When it runs |
|---|---|---|
| Systematic retry | Developer | Retries happen automatically before the LLM ever sees the error |
| Intelligent retry | LLM | Error context is returned to the agent loop; the model chooses what to do next |
Both tiers are active inside agent.run() and any programmatic tool call.
Quick Start¶
from cogent import Agent
from cogent.agent.resilience import ResilienceConfig
# Default: 3 retries, exponential-jitter backoff, hand off to LLM on exhaustion
agent = Agent(name="Bot", model=model, tools=[...])
# Explicit config
agent = Agent(
name="Bot",
model=model,
tools=[...],
resilience=ResilienceConfig(
max_retries=3,
strategy="exponential_jitter",
),
)
Systematic Retry¶
The framework retries the failing tool call mechanically using the configured backoff schedule. The LLM is not consulted during this tier.
agent = Agent(
name="ReliableBot",
model=model,
tools=[flaky_api],
observer="detailed", # see retry events in output
resilience=ResilienceConfig(
max_retries=3,
strategy="exponential",
base_delay=1.0,
max_delay=30.0,
),
)
result = await agent.run("Fetch the report")
# Inspect retry events after the run
errors = result.events_of("tool.error")
print(f"Tool was retried {len(errors)} time(s)")
Backoff Strategies¶
| Value | Behaviour |
|---|---|
"exponential_jitter" |
base_delay × 2^(attempt-1) plus random jitter (default) |
"exponential" |
base_delay × 2^(attempt-1), no jitter |
"linear" |
base_delay × attempt |
"fixed" |
Constant base_delay between retries |
"none" |
No delay |
Strategy values are case-insensitive strings or RetryStrategy enum members.
Retryable vs. Non-Retryable Errors¶
By default the policy retries on ConnectionError, TimeoutError, OSError,
and any exception whose message contains common transient-error patterns
("rate limit", "503", "too many requests", etc.).
It does not retry on ValueError, TypeError, PermissionError,
KeyError, or messages matching auth patterns ("401", "api key",
"unauthorized", etc.).
Intelligent Retry¶
When systematic retries are exhausted, the error context (tool name, args, error message, attempt count) is automatically fed back into the agent loop. The LLM then decides what to do: try a different tool, reformulate the arguments, or explain why it cannot proceed.
agent = Agent(
name="SearchBot",
model=model,
tools=[web_search, cached_search, local_index],
observer="progress",
resilience=ResilienceConfig(
max_retries=0, # fail on first error — let the LLM cascade
),
instructions=(
"Try web_search first, then cached_search, then local_index. "
"When a tool fails, immediately try the next one."
),
)
result = await agent.run("Find information about the framework")
on_exhaustion="raise" propagates the exception to the caller instead.
Use this when you want hard failures rather than graceful recovery.
Model Escalation¶
When a cheaper or faster model keeps failing structured output validation,
fallback_model escalates to a more capable model for one extra attempt.
The fallback sees the full correction history from previous failures.
agent = Agent(
name="Extractor",
model="gpt-5.4-nano", # fast/cheap for first attempts
resilience=ResilienceConfig(
fallback_model="gpt-5.4", # escalate to larger model on exhaustion
),
)
The escalation sequence: primary model attempts → self-correction retries →
fallback model attempt → on_exhaustion behaviour.
Per-Tool Overrides¶
Override any ResilienceConfig field for specific tools:
resilience = ResilienceConfig(
max_retries=3,
strategy="exponential_jitter",
tool_overrides={
"payment_api": {"max_retries": 0}, # never retry payment calls
"flaky_search": {"max_retries": 5, "base_delay": 0.5},
"slow_report": {"timeout_seconds": 300.0}, # 5-minute timeout
},
)
Each override is a flat dict with any subset of ResilienceConfig fields.
Timeout¶
Timeout applies to each individual attempt. A timed-out call raises
TimeoutError, which is retryable by default.
Observing Retries¶
Pass a string level to observer for inline output:
agent = Agent(
name="Bot",
model=model,
tools=[...],
observer="progress", # shows retry events as they happen
resilience=ResilienceConfig(max_retries=3),
)
Retry events appear in the output alongside the call and final outcome:
[Bot] [tool-call] a1b2c3d4
flaky_api(query='data')
[Bot] [tool-failed] a1b2c3d4
flaky_api ConnectionError: upstream timeout
[Bot] [tool-retry] a1b2c3d4 1/3
flaky_api ConnectionError: upstream timeout
[Bot] [tool-retry] a1b2c3d4 2/3
flaky_api ConnectionError: upstream timeout
[Bot] [tool-result] a1b2c3d4 (1.2s)
flaky_api {status: 'ok', data: ...}
Or inspect after the run:
result = await agent.run("...")
retries = result.events_of("tool.retry") # one event per failed attempt
for evt in retries:
print(evt.data["tool_name"], evt.data["attempt"], evt.data["error"])
ResilienceConfig Reference¶
| Field | Type | Default | Description |
|---|---|---|---|
max_retries |
int |
3 |
Retry attempts after first failure. 0 = no retry. |
strategy |
str \| RetryStrategy |
"exponential_jitter" |
Backoff strategy. |
base_delay |
float |
1.0 |
Base delay in seconds. |
max_delay |
float |
60.0 |
Delay cap in seconds. |
jitter_factor |
float |
0.25 |
Jitter multiplier (exponential_jitter only). |
on_exhaustion |
"raise" \| "return" |
"return" |
Behaviour when retries are exhausted (tools and structured output). |
fallback_model |
str \| BaseChatModel \| None |
None |
Escalate to a stronger model after primary retries are exhausted. |
timeout_seconds |
float \| None |
60.0 |
Per-call timeout. None disables. |
tool_overrides |
dict[str, dict] |
{} |
Per-tool field overrides. |
Migration from < 1.18.0¶
The following APIs were removed in 1.18.0:
| Removed | Replacement |
|---|---|
ResilienceConfig(retry_policy=RetryPolicy(...)) |
ResilienceConfig(max_retries=3, strategy="exponential") |
ResilienceConfig.aggressive() |
ResilienceConfig(max_retries=5, base_delay=0.5) |
ResilienceConfig.fast_fail() |
ResilienceConfig(max_retries=0, on_exhaustion="raise") |
ResilienceConfig.balanced() |
ResilienceConfig() (default) |
CircuitBreaker |
Remove — use on_exhaustion="return" instead |
FallbackRegistry |
Remove — register fallback tools directly and let the LLM cascade |
RecoveryAction |
Remove — on_exhaustion covers the supported modes |
See Also¶
examples/resilience/tool_resilience.py— Live demos of all three recovery tiers- docs/observability.md — Observer levels and event inspection
- docs/tool-building.md — Creating tools