Cache, Episodic & ACC¶
Advanced memory systems: semantic caching, graph-backed episodic memory, and agentic context compression.
See Memory Overview for the full memory architecture.
Semantic Cache¶
SemanticCache provides embedding-based caching with configurable similarity thresholds. When a query is "close enough" to a cached entry, return the cached result instead of making an expensive LLM or API call.
Key Benefits: - 80%+ hit rates — Cache similar queries, not just exact matches - 7-10× speedup — Cached responses return instantly - Cost reduction — Fewer API calls = lower costs - Automatic eviction — LRU policy and TTL expiration
Quick Start¶
Enable caching with cache=True:
from cogent import Agent
agent = Agent(
model="gpt-5.4-mini",
cache=True, # Enable semantic cache with defaults
)
# First query
await agent.run("What are the best Python frameworks?")
# Similar query hits cache (instant!)
await agent.run("What are the top Python frameworks?")
Custom Configuration¶
Pass a SemanticCache instance for custom settings:
from cogent import Agent
from cogent.memory import SemanticCache
from cogent.models import create_embedding
# Create embedding model
embed = create_embedding("openai", "text-embedding-3-small")
agent = Agent(
model="gpt-5.4-mini",
cache=SemanticCache(
embedding=embed, # Embedding model (required for custom)
similarity_threshold=0.90, # Stricter matching (default: 0.85)
max_entries=5000, # Cache size (default: 10000)
default_ttl=3600, # 1 hour TTL (default: 86400)
),
)
Similarity Threshold:
| Threshold | Behavior | Use Case |
|---|---|---|
| 0.95-1.0 | Very strict, near-exact | Deterministic outputs |
| 0.85-0.95 | Balanced, similar intent | General purpose (default) |
| 0.70-0.85 | Loose, broad matching | Exploratory queries |
Tool-Level Caching¶
Use @tool(cache=True) to cache expensive tool calls:
from cogent import Agent, tool
@tool(cache=True)
async def search_products(query: str) -> str:
"""Search products in the catalog."""
return await product_api.search(query)
agent = Agent(
model="gpt-5.4-mini",
tools=[search_products],
cache=True, # Required — tools use agent's cache
)
# First call executes the tool
await agent.run("Find running shoes")
# Similar query hits cache
await agent.run("Show me running sneakers") # Cache hit!
See tool-building.md for more details.
When to Use¶
| Use Semantic Cache When | Don't Use When |
|---|---|
| User queries with variation | Need exact-match guarantees |
Episodic Memory (Graph-Backed)¶
Episodic memory gives agents structured recall across turns and sessions. An episode captures a sequence of observations (user messages, tool results, reflections) as graph entities linked by temporal edges. After each turn an LLM extracts entities and relationships from the conversation, building a persistent semantic knowledge graph. Recall uses graph traversal (BFS + Personalized PageRank) to return structured facts — not raw conversation dumps.
Inspired by AriGraph and Graphiti/Zep.
Basic Usage¶
from cogent.graph import Graph
from cogent.memory import EpisodicMemory
graph = Graph()
episodic = EpisodicMemory(graph) # model set by Agent automatically
# Start an episode for a conversation
ep = await episodic.start_episode(agent="assistant", thread_id="t-1")
# Record observations as the conversation progresses
await episodic.add_observation(ep.id, "User asked about Python async.", kind="user")
await episodic.add_observation(ep.id, "Explained asyncio event loop.", kind="assistant")
# Close when the interaction ends
await episodic.close_episode(ep.id, outcome="success")
Recording Turns with Entity Extraction¶
record_turn records a full user→assistant exchange and, when a model
is available, automatically extracts entities and relationships into
the semantic graph:
ep = await episodic.start_episode(agent="assistant", thread_id="t-1")
# Entities like "Python", "asyncio" and relationships like
# "Python --has_library--> asyncio" are auto-extracted
await episodic.record_turn(
ep.id,
user_message="How does Python's asyncio work?",
assistant_message="asyncio provides an event loop for async I/O.",
model=my_model, # or set on EpisodicMemory / via Agent
)
Semantic Graph¶
After recording, the graph contains semantic entities and their relationships alongside the episodic observation nodes:
# See what entities were extracted
entities = await episodic.get_entities()
for e in entities:
print(f"{e['name']} ({e['type']}) — mentioned {e['mention_count']}x")
# See all relationship facts
facts = await episodic.get_facts()
for f in facts:
print(f"{f.subject} --[{f.relation}]--> {f.object}")
Semantic Linking (Manual)¶
You can also manually link observations to entities in the graph:
await graph.add_entity("python", "Language", name="Python")
obs = await episodic.add_observation(ep.id, "Discussed Python async.")
await episodic.link_entity(obs.id, "python")
Recall¶
Recall extracts query entities via LLM, resolves them to graph nodes,
and uses BFS + PPR to find connected facts. Results are structured
RecalledFact triples, not raw conversation text:
facts = await episodic.recall("async event loops", top_k=5)
for fact in facts:
print(f"{fact.subject} {fact.relation} {fact.object}")
# Filter by agent, thread, or user
facts = await episodic.recall("async", agent="assistant", thread_id="t-1")
When no model is available or no semantic entities exist, recall falls back to keyword-based observation matching (backwards compatible).
Reflection¶
reflect() reads recent observations, asks an LLM to identify patterns and
lessons, and stores each lesson as an Observation(kind="reflection") in the
episodic graph. Reflections are then discoverable via recall() alongside
regular observations.
# Reflect on a specific episode
reflections = await episodic.reflect(model, episode_id=ep.id)
# Reflect across recent episodes for an agent
reflections = await episodic.reflect(model, agent="assistant", max_observations=20)
for r in reflections:
print(r.content) # "User is building Python knowledge progressively"
Inspired by Generative Agents (Park et al., 2023) and CoALA's learning actions (Sumers et al., 2024).
Querying Episodes¶
# List episodes (most recent first)
episodes = await episodic.list_episodes(agent="assistant", limit=10)
# Get observations for a specific episode
observations = await episodic.get_observations(ep.id)
When to Use¶
| Use Episodic Memory When | Don't Use When |
|---|---|
| Long-running agents across sessions | Single-turn Q&A |
| Need structured recall of past interactions | Simple key-value facts (use knowledge memory) |
| Want cross-episode entity linking | Conversation history suffices |
| Multi-agent shared experience graph | Bounded context is enough (ACC) |
ACC (Agentic Context Compression)¶
ACC provides bounded memory for long conversations, preventing context drift and memory poisoning.
Basic Usage¶
Enable ACC with acc=True on Agent:
from cogent import Agent
# Enable on Agent (bool or ACC instance)
agent = Agent(name="Assistant", model="gpt-5.4", acc=True)
Custom ACC Bounds¶
For fine-grained control, pass custom bounds directly:
from cogent import Agent
from cogent.memory.acc import AgentCognitiveCompressor
# Create ACC with custom bounds
acc = AgentCognitiveCompressor(
max_constraints=10, # Rules, guidelines
max_entities=30, # Facts, knowledge
max_actions=20, # Past actions
max_context=15, # Relevant context
)
# Pass directly to Agent
agent = Agent(name="Assistant", model="gpt-5.4", acc=acc)
Thread ID for Context Persistence¶
ACC requires thread_id to persist context across multiple run() calls:
# Same thread_id = context persists
await agent.run("My name is Alice", thread_id="session-1")
await agent.run("What's my name?", thread_id="session-1") # Remembers!
# Different thread_id = fresh context
await agent.run("What's my name?", thread_id="session-2") # Doesn't know
When to Use ACC¶
| Use ACC When | Don't Use When |
|---|---|
| Long conversations (>10 turns) | Short, one-off queries |
| Need to prevent context drift | Stateless operations |
| Bounded memory is critical | Need full conversation replay |
| Multi-turn workflows | Simple Q&A |
See acc.md for detailed ACC documentation. | Similar questions rephrased | Outputs must be deterministic | | Intent-based matching | Query structure matters | | High query volume | Low query volume |