Cache, Episodic & ACC¶

Advanced memory systems: semantic caching, graph-backed episodic memory, and agentic context compression.

See Memory Overview for the full memory architecture.

Semantic Cache¶

SemanticCache provides embedding-based caching with configurable similarity thresholds. When a query is "close enough" to a cached entry, return the cached result instead of making an expensive LLM or API call.

Key Benefits: - 80%+ hit rates — Cache similar queries, not just exact matches - 7-10× speedup — Cached responses return instantly - Cost reduction — Fewer API calls = lower costs - Automatic eviction — LRU policy and TTL expiration

Quick Start¶

Enable caching with cache=True:

from cogent import Agent

agent = Agent(
    model="gpt-5.4-mini",
    cache=True,  # Enable semantic cache with defaults
)

# First query
await agent.run("What are the best Python frameworks?")

# Similar query hits cache (instant!)
await agent.run("What are the top Python frameworks?")

Custom Configuration¶

Pass a SemanticCache instance for custom settings:

from cogent import Agent
from cogent.memory import SemanticCache
from cogent.models import create_embedding

# Create embedding model
embed = create_embedding("openai", "text-embedding-3-small")

agent = Agent(
    model="gpt-5.4-mini",
    cache=SemanticCache(
        embedding=embed,            # Embedding model (required for custom)
        similarity_threshold=0.90,  # Stricter matching (default: 0.85)
        max_entries=5000,           # Cache size (default: 10000)
        default_ttl=3600,           # 1 hour TTL (default: 86400)
    ),
)

Similarity Threshold:

Threshold	Behavior	Use Case
0.95-1.0	Very strict, near-exact	Deterministic outputs
0.85-0.95	Balanced, similar intent	General purpose (default)
0.70-0.85	Loose, broad matching	Exploratory queries

Tool-Level Caching¶

Use @tool(cache=True) to cache expensive tool calls:

from cogent import Agent, tool

@tool(cache=True)
async def search_products(query: str) -> str:
    """Search products in the catalog."""
    return await product_api.search(query)

agent = Agent(
    model="gpt-5.4-mini",
    tools=[search_products],
    cache=True,  # Required — tools use agent's cache
)

# First call executes the tool
await agent.run("Find running shoes")

# Similar query hits cache
await agent.run("Show me running sneakers")  # Cache hit!

See tool-building.md for more details.

When to Use¶

Use Semantic Cache When	Don't Use When
User queries with variation	Need exact-match guarantees

Episodic Memory (Graph-Backed)¶

Episodic memory gives agents structured recall across turns and sessions. An episode captures a sequence of observations (user messages, tool results, reflections) as graph entities linked by temporal edges. After each turn an LLM extracts entities and relationships from the conversation, building a persistent semantic knowledge graph. Recall uses graph traversal (BFS + Personalized PageRank) to return structured facts — not raw conversation dumps.

Inspired by AriGraph and Graphiti/Zep.

Basic Usage¶

from cogent.graph import Graph
from cogent.memory import EpisodicMemory

graph = Graph()
episodic = EpisodicMemory(graph)  # model set by Agent automatically

# Start an episode for a conversation
ep = await episodic.start_episode(agent="assistant", thread_id="t-1")

# Record observations as the conversation progresses
await episodic.add_observation(ep.id, "User asked about Python async.", kind="user")
await episodic.add_observation(ep.id, "Explained asyncio event loop.", kind="assistant")

# Close when the interaction ends
await episodic.close_episode(ep.id, outcome="success")

Recording Turns with Entity Extraction¶

record_turn records a full user→assistant exchange and, when a model is available, automatically extracts entities and relationships into the semantic graph:

ep = await episodic.start_episode(agent="assistant", thread_id="t-1")

# Entities like "Python", "asyncio" and relationships like
# "Python --has_library--> asyncio" are auto-extracted
await episodic.record_turn(
    ep.id,
    user_message="How does Python's asyncio work?",
    assistant_message="asyncio provides an event loop for async I/O.",
    model=my_model,  # or set on EpisodicMemory / via Agent
)

Semantic Graph¶

After recording, the graph contains semantic entities and their relationships alongside the episodic observation nodes:

# See what entities were extracted
entities = await episodic.get_entities()
for e in entities:
    print(f"{e['name']} ({e['type']}) — mentioned {e['mention_count']}x")

# See all relationship facts
facts = await episodic.get_facts()
for f in facts:
    print(f"{f.subject} --[{f.relation}]--> {f.object}")

Semantic Linking (Manual)¶

You can also manually link observations to entities in the graph:

await graph.add_entity("python", "Language", name="Python")
obs = await episodic.add_observation(ep.id, "Discussed Python async.")
await episodic.link_entity(obs.id, "python")

Recall¶

Recall extracts query entities via LLM, resolves them to graph nodes, and uses BFS + PPR to find connected facts. Results are structured RecalledFact triples, not raw conversation text:

facts = await episodic.recall("async event loops", top_k=5)
for fact in facts:
    print(f"{fact.subject} {fact.relation} {fact.object}")

# Filter by agent, thread, or user
facts = await episodic.recall("async", agent="assistant", thread_id="t-1")

When no model is available or no semantic entities exist, recall falls back to keyword-based observation matching (backwards compatible).

Reflection¶

reflect() reads recent observations, asks an LLM to identify patterns and lessons, and stores each lesson as an Observation(kind="reflection") in the episodic graph. Reflections are then discoverable via recall() alongside regular observations.

# Reflect on a specific episode
reflections = await episodic.reflect(model, episode_id=ep.id)

# Reflect across recent episodes for an agent
reflections = await episodic.reflect(model, agent="assistant", max_observations=20)

for r in reflections:
    print(r.content)  # "User is building Python knowledge progressively"

Inspired by Generative Agents (Park et al., 2023) and CoALA's learning actions (Sumers et al., 2024).

Querying Episodes¶

# List episodes (most recent first)
episodes = await episodic.list_episodes(agent="assistant", limit=10)

# Get observations for a specific episode
observations = await episodic.get_observations(ep.id)

When to Use¶

Use Episodic Memory When	Don't Use When
Long-running agents across sessions	Single-turn Q&A
Need structured recall of past interactions	Simple key-value facts (use knowledge memory)
Want cross-episode entity linking	Conversation history suffices
Multi-agent shared experience graph	Bounded context is enough (ACC)

ACC (Agentic Context Compression)¶

ACC provides bounded memory for long conversations, preventing context drift and memory poisoning.

Basic Usage¶

Enable ACC with acc=True on Agent:

from cogent import Agent

# Enable on Agent (bool or ACC instance)
agent = Agent(name="Assistant", model="gpt-5.4", acc=True)

Custom ACC Bounds¶

For fine-grained control, pass custom bounds directly:

from cogent import Agent
from cogent.memory.acc import AgentCognitiveCompressor

# Create ACC with custom bounds
acc = AgentCognitiveCompressor(
    max_constraints=10,  # Rules, guidelines
    max_entities=30,     # Facts, knowledge
    max_actions=20,      # Past actions
    max_context=15,      # Relevant context
)

# Pass directly to Agent
agent = Agent(name="Assistant", model="gpt-5.4", acc=acc)

Thread ID for Context Persistence¶

ACC requires thread_id to persist context across multiple run() calls:

# Same thread_id = context persists
await agent.run("My name is Alice", thread_id="session-1")
await agent.run("What's my name?", thread_id="session-1")  # Remembers!

# Different thread_id = fresh context
await agent.run("What's my name?", thread_id="session-2")  # Doesn't know

When to Use ACC¶

Use ACC When	Don't Use When
Long conversations (>10 turns)	Short, one-off queries
Need to prevent context drift	Stateless operations
Bounded memory is critical	Need full conversation replay
Multi-turn workflows	Simple Q&A