Skip to content

Cache, Episodic & ACC

Advanced memory systems: semantic caching, graph-backed episodic memory, and agentic context compression.

See Memory Overview for the full memory architecture.

Semantic Cache

SemanticCache provides embedding-based caching with configurable similarity thresholds. When a query is "close enough" to a cached entry, return the cached result instead of making an expensive LLM or API call.

Key Benefits: - 80%+ hit rates — Cache similar queries, not just exact matches - 7-10× speedup — Cached responses return instantly - Cost reduction — Fewer API calls = lower costs - Automatic eviction — LRU policy and TTL expiration

Quick Start

Enable caching with cache=True:

from cogent import Agent

agent = Agent(
    model="gpt-5.4-mini",
    cache=True,  # Enable semantic cache with defaults
)

# First query
await agent.run("What are the best Python frameworks?")

# Similar query hits cache (instant!)
await agent.run("What are the top Python frameworks?")

Custom Configuration

Pass a SemanticCache instance for custom settings:

from cogent import Agent
from cogent.memory import SemanticCache
from cogent.models import create_embedding

# Create embedding model
embed = create_embedding("openai", "text-embedding-3-small")

agent = Agent(
    model="gpt-5.4-mini",
    cache=SemanticCache(
        embedding=embed,            # Embedding model (required for custom)
        similarity_threshold=0.90,  # Stricter matching (default: 0.85)
        max_entries=5000,           # Cache size (default: 10000)
        default_ttl=3600,           # 1 hour TTL (default: 86400)
    ),
)

Similarity Threshold:

Threshold Behavior Use Case
0.95-1.0 Very strict, near-exact Deterministic outputs
0.85-0.95 Balanced, similar intent General purpose (default)
0.70-0.85 Loose, broad matching Exploratory queries

Tool-Level Caching

Use @tool(cache=True) to cache expensive tool calls:

from cogent import Agent, tool

@tool(cache=True)
async def search_products(query: str) -> str:
    """Search products in the catalog."""
    return await product_api.search(query)

agent = Agent(
    model="gpt-5.4-mini",
    tools=[search_products],
    cache=True,  # Required — tools use agent's cache
)

# First call executes the tool
await agent.run("Find running shoes")

# Similar query hits cache
await agent.run("Show me running sneakers")  # Cache hit!

See tool-building.md for more details.

When to Use

Use Semantic Cache When Don't Use When
User queries with variation Need exact-match guarantees

Episodic Memory (Graph-Backed)

Episodic memory gives agents structured recall across turns and sessions. An episode captures a sequence of observations (user messages, tool results, reflections) as graph entities linked by temporal edges. After each turn an LLM extracts entities and relationships from the conversation, building a persistent semantic knowledge graph. Recall uses graph traversal (BFS + Personalized PageRank) to return structured facts — not raw conversation dumps.

Inspired by AriGraph and Graphiti/Zep.

Basic Usage

from cogent.graph import Graph
from cogent.memory import EpisodicMemory

graph = Graph()
episodic = EpisodicMemory(graph)  # model set by Agent automatically

# Start an episode for a conversation
ep = await episodic.start_episode(agent="assistant", thread_id="t-1")

# Record observations as the conversation progresses
await episodic.add_observation(ep.id, "User asked about Python async.", kind="user")
await episodic.add_observation(ep.id, "Explained asyncio event loop.", kind="assistant")

# Close when the interaction ends
await episodic.close_episode(ep.id, outcome="success")

Recording Turns with Entity Extraction

record_turn records a full user→assistant exchange and, when a model is available, automatically extracts entities and relationships into the semantic graph:

ep = await episodic.start_episode(agent="assistant", thread_id="t-1")

# Entities like "Python", "asyncio" and relationships like
# "Python --has_library--> asyncio" are auto-extracted
await episodic.record_turn(
    ep.id,
    user_message="How does Python's asyncio work?",
    assistant_message="asyncio provides an event loop for async I/O.",
    model=my_model,  # or set on EpisodicMemory / via Agent
)

Semantic Graph

After recording, the graph contains semantic entities and their relationships alongside the episodic observation nodes:

# See what entities were extracted
entities = await episodic.get_entities()
for e in entities:
    print(f"{e['name']} ({e['type']}) — mentioned {e['mention_count']}x")

# See all relationship facts
facts = await episodic.get_facts()
for f in facts:
    print(f"{f.subject} --[{f.relation}]--> {f.object}")

Semantic Linking (Manual)

You can also manually link observations to entities in the graph:

await graph.add_entity("python", "Language", name="Python")
obs = await episodic.add_observation(ep.id, "Discussed Python async.")
await episodic.link_entity(obs.id, "python")

Recall

Recall extracts query entities via LLM, resolves them to graph nodes, and uses BFS + PPR to find connected facts. Results are structured RecalledFact triples, not raw conversation text:

facts = await episodic.recall("async event loops", top_k=5)
for fact in facts:
    print(f"{fact.subject} {fact.relation} {fact.object}")

# Filter by agent, thread, or user
facts = await episodic.recall("async", agent="assistant", thread_id="t-1")

When no model is available or no semantic entities exist, recall falls back to keyword-based observation matching (backwards compatible).

Reflection

reflect() reads recent observations, asks an LLM to identify patterns and lessons, and stores each lesson as an Observation(kind="reflection") in the episodic graph. Reflections are then discoverable via recall() alongside regular observations.

# Reflect on a specific episode
reflections = await episodic.reflect(model, episode_id=ep.id)

# Reflect across recent episodes for an agent
reflections = await episodic.reflect(model, agent="assistant", max_observations=20)

for r in reflections:
    print(r.content)  # "User is building Python knowledge progressively"

Inspired by Generative Agents (Park et al., 2023) and CoALA's learning actions (Sumers et al., 2024).

Querying Episodes

# List episodes (most recent first)
episodes = await episodic.list_episodes(agent="assistant", limit=10)

# Get observations for a specific episode
observations = await episodic.get_observations(ep.id)

When to Use

Use Episodic Memory When Don't Use When
Long-running agents across sessions Single-turn Q&A
Need structured recall of past interactions Simple key-value facts (use knowledge memory)
Want cross-episode entity linking Conversation history suffices
Multi-agent shared experience graph Bounded context is enough (ACC)

ACC (Agentic Context Compression)

ACC provides bounded memory for long conversations, preventing context drift and memory poisoning.

Basic Usage

Enable ACC with acc=True on Agent:

from cogent import Agent

# Enable on Agent (bool or ACC instance)
agent = Agent(name="Assistant", model="gpt-5.4", acc=True)

Custom ACC Bounds

For fine-grained control, pass custom bounds directly:

from cogent import Agent
from cogent.memory.acc import AgentCognitiveCompressor

# Create ACC with custom bounds
acc = AgentCognitiveCompressor(
    max_constraints=10,  # Rules, guidelines
    max_entities=30,     # Facts, knowledge
    max_actions=20,      # Past actions
    max_context=15,      # Relevant context
)

# Pass directly to Agent
agent = Agent(name="Assistant", model="gpt-5.4", acc=acc)

Thread ID for Context Persistence

ACC requires thread_id to persist context across multiple run() calls:

# Same thread_id = context persists
await agent.run("My name is Alice", thread_id="session-1")
await agent.run("What's my name?", thread_id="session-1")  # Remembers!

# Different thread_id = fresh context
await agent.run("What's my name?", thread_id="session-2")  # Doesn't know

When to Use ACC

Use ACC When Don't Use When
Long conversations (>10 turns) Short, one-off queries
Need to prevent context drift Stateless operations
Bounded memory is critical Need full conversation replay
Multi-turn workflows Simple Q&A

See acc.md for detailed ACC documentation. | Similar questions rephrased | Outputs must be deterministic | | Intent-based matching | Query structure matters | | High query volume | Low query volume |