Skip to content

Memory Module

The cogent.memory module provides a memory-first architecture where memory is a first-class citizen that can be wired to any agent.

Overview

Memory enables agents to: - Persist knowledge across conversations - Share state between agents - Perform semantic search over memories - Scope memories by user, team, or conversation - ACC (Agentic Context Compression) — Bounded context for long conversations

from cogent import Agent
from cogent.memory import Memory

# Basic in-memory storage
memory = Memory()
await memory.remember("user_preference", "dark mode")
value = await memory.recall("user_preference")

# Wire to an agent
agent = Agent(name="assistant", model=model, memory=memory)

# ACC enabled (prevents drift in long conversations)
agent = Agent(name="assistant", model=model, acc=True)

Memory Architecture

Cogent provides five distinct memory systems that work together:

System Parameter Mechanism When to Use
Conversation conversation=True Automatic message concatenation Short sessions, full context needed
ACC acc=True Agentic Context Compression Long conversations, prevent drift
Knowledge memory=True or Memory(...) Agentic tools and/or auto-retrieval Persistent knowledge, RAG, semantic search
Episodic episodic=True Graph-backed temporal recall Cross-session experience, patterns
Cache cache=True Semantic tool output cache Expensive/slow tool calls

Conversation History (Automatic)

Raw message concatenation - all previous messages automatically sent to LLM:

agent = Agent(name="Assistant", model="gpt4")  # conversation=True by default

await agent.run("Hi, I'm Alice", thread_id="session1")
await agent.run("What's my name?", thread_id="session1")

# Internally sends to LLM:
# [
#   {"role": "user", "content": "Hi, I'm Alice"},
#   {"role": "assistant", "content": "Hello Alice!"},
#   {"role": "user", "content": "What's my name?"}  # <-- Full history
# ]

Characteristics: - ✅ Automatic - No tools needed, no LLM decision required - ✅ Works immediately - LLM sees full context - ✅ Perfect recall - Nothing lost from conversation - ❌ Grows unbounded - Context window fills up over time - ❌ No semantic search - Just chronological concatenation - ❌ Session-bound - Lost when thread ends

When to use: Short sessions where full context fits in window.

ACC (Agentic Context Compression)

Compresses growing conversation history into structured constraints and entities:

agent = Agent(name="Assistant", model="gpt4", acc=True)

# After many messages, ACC compresses into:
# Constraints: ["User prefers dark mode", "User timezone is EST", "Project deadline: March 1"]
# Entities: ["Alice (user)", "Project Alpha (active)", "Bob (team lead)"]
# Only compressed context sent to LLM, not full 50-message history

Characteristics: - ✅ Bounded context - Prevents window overflow - ✅ Automatic - No LLM tool calls needed - ✅ Prevents drift - Maintains key facts across long sessions - ✅ Structured - Constraints + Entities format - ❌ Lossy - Some details discarded during compression

When to use: Long conversations that exceed context window.

Knowledge Memory (Agentic or Non-Agentic)

Memory supports two access patterns for persistent knowledge, controlled by the tools and retriever parameters on Memory:

Agentic (default) — LLM decides when to use memory

agent = Agent(name="Assistant", model="gpt4", memory=True)

# Agent gets memory tools automatically
# LLM decides when to use them:

await agent.run("Remember that I prefer dark mode")
# LLM calls: remember(key="user_preference", value="dark mode")

await agent.run("What's my UI preference?")
# LLM calls: recall(query="user preference")
# Returns: "dark mode"

Non-agentic — automatic retrieval into context

from cogent.memory import Memory
from cogent.retrieval import DenseRetriever
from cogent.vectorstore import VectorStore

vs = VectorStore()
retriever = DenseRetriever(vs, score_threshold=0.7)

# Non-agentic only — no tools, auto-retrieval
agent = Agent(
    name="Assistant",
    model="gpt4",
    memory=Memory(retriever=retriever, tools=False),
)

# At each turn, the retriever runs against the user message
# and injects results as a SystemMessage ("# Relevant Knowledge")
await agent.run("Tell me about Python async")

Both — auto-retrieval plus tools

# Retriever injects relevant knowledge AND agent can use tools
agent = Agent(
    name="Assistant",
    model="gpt4",
    memory=Memory(retriever=retriever),  # tools=True by default
)
retriever= tools= Behavior
None True (default) Agentic only — memory=True backward compatible
set False Non-agentic only — auto-inject, no tools
set True Both — auto-inject + tools

Characteristics (agentic): - ✅ Semantic search - Finds relevant memories by meaning - ✅ Persistent - Survives across sessions/threads - ✅ Selective - LLM stores only important info - ❌ Requires LLM decision - LLM must choose to call tools - ❌ Tool call overhead - Adds latency when used

Characteristics (non-agentic): - ✅ Automatic - Retrieved and injected every turn, no LLM decision - ✅ Backend-agnostic - Works with any BaseRetriever (vector, graph, hybrid) - ✅ Configurable - top_k, min_score, reranking on the retriever - ❌ Always runs - Retrieval cost every turn even if not needed

When to use: Agentic for selective recall. Non-agentic for knowledge bases, documents, and RAG-style workflows where context should always be available.

Episodic Memory (Cross-Session Recall)

Graph-backed temporal memory that automatically records each conversation turn, extracts entities and relationships via LLM, and recalls structured facts from the semantic graph:

from cogent.memory import EpisodicMemory

agent = Agent(name="Assistant", model="gpt4", episodic=True)

# Session 1 — turns are recorded, entities are extracted into a graph
await agent.run("How does asyncio work?", thread_id="session-1", user_id="alice")
await agent.run("What about gather()?", thread_id="session-1", user_id="alice")

# Session 2 — structured facts are recalled via graph traversal
await agent.run("Does FastAPI use asyncio?", thread_id="session-2", user_id="alice")
# The agent sees a "Past Experiences" system message with
# entity-relationship facts from the semantic graph

User Identity and Recall Scope

Pass user_id to run() to identify who is using the agent. One user can have multiple threads. Episodic memory uses user_id to scope what past experiences are recalled:

# Default: scope="user" — each user sees only their own history
agent = Agent(name="Assistant", model="gpt4", episodic=True)

await agent.run("Rome tips?", thread_id="alice-1", user_id="alice")
await agent.run("French cuisine?", thread_id="bob-1", user_id="bob")

# Alice only recalls her own Rome history, not Bob's cuisine thread
await agent.run("Visit Naples?", thread_id="alice-2", user_id="alice")

The scope parameter on EpisodicMemory controls the recall boundary:

Scope Recalls Use case
"user" (default) This user's episodes across all agents Multi-user agents, shared episodic stores, privacy isolation
"agent" All episodes from this agent regardless of user Single-user, team/shared memory, knowledge transfer
"global" All episodes across all agents and users Cross-agent cross-user learning
# Shared team memory — all users benefit from each other's episodes
team_agent = Agent(
    name="TeamAssistant",
    model="gpt4",
    episodic=EpisodicMemory(scope="agent"),
)

user_id is also set on the RunContext metadata, making it accessible to tools and interceptors via ctx.metadata["user_id"].

Reflection (Episode → Lessons)

Pass reflection=True to EpisodicMemory to synthesize observations into higher-level lessons after each turn. Lessons are stored as kind="reflection" observations in the episodic graph and written to semantic memory (when available):

agent = Agent(
    name="Assistant",
    model="gpt4",
    episodic=EpisodicMemory(reflection=True),
    # reflection=3 — reflect every 3 turns
)

Reflections answer "what patterns emerge?" rather than "what happened?" — closing the loop between raw experience and reusable knowledge.

Pass a shared EpisodicMemory instance to let multiple agents share the same experience graph:

episodic = EpisodicMemory()

coder = Agent(name="Coder", model="gpt4", episodic=episodic)
reviewer = Agent(name="Reviewer", model="gpt4", episodic=episodic)

Characteristics:

  • Automatic — Records every turn and recalls relevant experiences without LLM decisions
  • Cross-session — Observations from any prior thread are searchable
  • User-scopeduser_id isolates recall boundaries; configurable via scope
  • Graph-backed — Uses Personalized PageRank for relevance ranking
  • Shareable — Multiple agents can share one episodic store
  • Reflective — Optional LLM synthesis distills lessons from raw observations
  • Keyword-seeded — Recall quality depends on term overlap with past content

When to use: Multi-session agents, learning from past interactions, agents that should improve over time.

Semantic Cache (Tool Outputs)

Caches tool results by semantic similarity to avoid redundant calls:

agent = Agent(name="Assistant", model="gpt4", cache=True)

# First call
await agent.run("Search for Python tutorials")  
# Calls search_tool(), caches result

# Similar query (different wording)
await agent.run("Find Python learning resources")  
# Cache hit! Returns previous result without calling search_tool()

Characteristics: - ✅ Speeds up repeated queries - Avoids slow/expensive tool calls - ✅ Semantic matching - Recognizes similar queries - ✅ Transparent - LLM doesn't know cache is used - ❌ Can return stale data - Cached results may be outdated - ❌ Storage overhead - Caches all tool outputs

When to use: Expensive API calls, slow database queries, rate-limited services.


Automatic vs Explicit

The key distinction is who decides to use memory:

# Conversation history (AUTOMATIC)
agent = Agent(name="Assistant", model="gpt4")  # conversation=True default
await agent.run("I'm Alice", thread_id="s1")
await agent.run("My name?", thread_id="s1")
# ✅ Works! History automatically sent to LLM
# No tool calls, no LLM decision needed

# Knowledge memory tools (EXPLICIT - LLM decides)
agent = Agent(name="Assistant", model="gpt4", memory=True)
await agent.run("Remember I'm Alice")
# ⚠️ LLM may or may not call remember() - it decides

await agent.run("My name?")
# ⚠️ LLM may or may not call recall() - it decides
# If LLM doesn't call the tool, memory isn't used!

# Non-agentic retrieval (AUTOMATIC)
agent = Agent(
    name="Assistant", model="gpt4",
    memory=Memory(retriever=retriever, tools=False),
)
await agent.run("Tell me about X")
# ✅ Relevant knowledge auto-injected as SystemMessage
# No LLM decision needed — retriever runs every turn

Recommendation: Use conversation memory for short-term context, agentic knowledge memory for selective recall, non-agentic knowledge memory for knowledge bases and RAG workflows.


How Memory Injects Into Context

Each memory system reaches the LLM through a different mechanism. Understanding the injection pattern matters for debugging, prompt engineering, and deciding which systems to combine.

Injection Patterns

System Injection Who Decides Timing
Conversation Messages replayed into history Automatic Before first LLM call
ACC Single SystemMessage (# Memory Context) Automatic Before first LLM call (replaces transcript)
Knowledge (agentic) ToolMessage results from recall()/search() LLM decides During agentic loop
Knowledge (non-agentic) Single SystemMessage (# Relevant Knowledge) Automatic Before first LLM call
Episodic Single SystemMessage (# Past Experiences) Automatic Before first LLM call
Cache Cached ToolMessage (substituted transparently) Automatic During tool execution

What the LLM Sees

When multiple layers are active, the message array sent to the LLM follows this order:

┌─────────────────────────────────────────────────────┐
│ SystemMessage   — agent instructions                │  always
├─────────────────────────────────────────────────────┤
│ SystemMessage   — # Memory Context                  │  ACC (if enabled)
│                   compressed constraints + entities  │
├─────────────────────────────────────────────────────┤
│ SystemMessage   — # Relevant Knowledge              │  Knowledge non-agentic (if retriever set)
│                   auto-retrieved documents           │
├─────────────────────────────────────────────────────┤
│ SystemMessage   — # Past Experiences                │  Episodic (if enabled)
│                   recalled observations from graph  │
├─────────────────────────────────────────────────────┤
│ HumanMessage    — prior user messages               │  Conversation only (skipped when ACC active)
│ AIMessage       — prior assistant replies            │
│ ...             — full transcript                    │
├─────────────────────────────────────────────────────┤
│ HumanMessage    — current user task                 │  always
├─────────────────────────────────────────────────────┤
│ AIMessage       — tool calls (if any)               │  agentic loop
│ ToolMessage     — tool results / recall / cache hit │
│ AIMessage       — final response                    │
└─────────────────────────────────────────────────────┘

Key Interactions

Conversation + ACC: Conversation always records the raw transcript. When ACC is enabled, the LLM sees a compressed SystemMessage instead of replayed history messages. Conversation still stores every exchange — ACC just controls what the model sees.

Knowledge memory has two modes: With tools=True (agentic), the LLM must decide to call recall() or search_memories() — nothing is injected automatically, so the agent can miss facts if it does not think to look. With retriever= (non-agentic), relevant knowledge is auto-retrieved and injected as a SystemMessage before the LLM runs. Both modes can be active simultaneously.

Episodic is automatic: Past experiences matching the current query are retrieved and injected before the LLM runs. The agent never calls a tool — it simply receives relevant observations as background context.

Cache is invisible: The LLM never knows caching happened. When a tool call matches a cached entry, the cached result is returned as the tool's output. The agent sees the same ToolMessage it would have seen without cache — just faster.

Context Budget and Composition

When multiple memory systems are active, each one injects content into the context window independently. There is no shared budget or coordination between them. Understanding what ACC compresses — and what it does not — is critical for avoiding context overflow in production.

What ACC Compresses

ACC replaces the conversation transcript only. After a turn completes, update_from_turn() receives exactly two inputs:

  • user_message — the current user task
  • assistant_message — the final model response

It does not receive or compress:

  • The # Relevant Knowledge SystemMessage (non-agentic retrieval)
  • The # Past Experiences SystemMessage (episodic recall)
  • Tool call results from the agentic loop (including cached results)

This means ACC bounds the conversation history portion of the context, but episodic and knowledge injections are re-fetched fresh every turn and added on top.

Independent Injection Pipelines

Each system is a separate pipe into the context window:

Context Window
├── Instructions (fixed)
├── # Memory Context          ← ACC-bounded (constraints + entities)
├── # Relevant Knowledge      ← unbounded (retriever controls volume)
├── # Past Experiences        ← unbounded (top_k observations)
├── Current task (fixed)
└── Agentic loop (tool calls + cache hits)

ACC keeps its own section bounded, but cannot shrink what the other systems inject. If the retriever returns 20 paragraphs or episodic returns 50 observations, that context is added every turn regardless of ACC.

Practical Implications

Scenario Risk Mitigation
ACC + large knowledge base Retriever results fill window despite ACC bounds Set top_k and score_threshold on the retriever
ACC + many episodic sessions Observations accumulate across sessions Episodic top_k=5 (default) limits per-turn injection
ACC + non-agentic + episodic Three SystemMessages compete for space Monitor total injected tokens via observer events
All systems active No single system sees the total budget Use observer to track combined context size

Recommendations

  • Tune retriever limits. Set top_k and score_threshold on DenseRetriever to cap how much knowledge is injected per turn. You can also pass retrieval_k on Agent to override the retriever's default at the agent level.
  • Monitor with observer. Enable memory_events=True to see how much each system injects. Watch [memory-retrieved] result counts and [episodic-recalled] observation counts.
  • Cap episodic recall. The default top_k=5 is conservative. Increase only when the agent genuinely benefits from more cross-session context.
  • Set a memory budget. Use memory_budget on Agent to cap each injected SystemMessage to a maximum character count. Content exceeding the limit is truncated with a [truncated] marker.
  • Agentic knowledge is self-regulating. When tools=True, the LLM decides whether to call recall() or search_memories(). Only results the LLM requests enter the context. Non-agentic mode injects every turn.
  • Cache does not affect context size. It substitutes tool results transparently — the injected ToolMessage is the same size whether cached or fresh.

Agent Parameters for Context Control

from cogent import Agent
from cogent.memory import EpisodicMemory

agent = Agent(
    name="Assistant",
    model="gpt-5.4",
    acc=True,
    episodic=EpisodicMemory(top_k=3, scope="user"),
    memory=Memory(retriever=retriever, tools=False),
    # Context budget controls
    memory_budget=8000,     # Max chars per injected SystemMessage
    retrieval_k=2,          # Retrieve 2 documents instead of retriever default
)

Agent-level parameters:

Parameter Default Controls
memory_budget None (unlimited) Max chars per injected SystemMessage (ACC, episodic, knowledge)
retrieval_k None (retriever default) Number of retriever results per turn

EpisodicMemory parameters:

Parameter Default Controls
top_k 5 Number of facts recalled per turn
scope "user" Recall boundary: "user", "agent", or "global"
reflection False Reflection cadence: True (every turn) or int N (every N turns)
model None Chat model for entity extraction (set automatically by Agent)

Observability

The memory stack emits lifecycle events so memory behavior is not a black box. Enable with Observer(memory_events=True) or level="debug". The agent's observer auto-propagates to Memory — no explicit Memory(observer=...) wiring is needed.

Each memory system gets a distinct label prefix in the console so you can tell at a glance which system emitted an event:

System Console Label Events What You See
Conversation [conversation-*] loaded, saved Thread history resume/save, message counts
ACC [acc-*] loaded, saved, context, updated Bounded state size, item breakdown
Knowledge (agentic) [tool-*] called, result Agent decisions to remember/recall/search
Knowledge (non-agentic) [memory-*] retrieved Retriever name, query, scores, duration
Episodic [episodic-*] recalled, recorded, reflected Observation counts, source episodes, query, lesson count
Cache [cache-*] hit, miss, write Tool name, cache key, similarity score

Notes:

  • Agentic knowledge memory uses [tool-*] events from the ToolFormatter; non-agentic uses [memory-retrieved].
  • [tool-*] events remain the canonical signal that an agent selected and used a tool.
  • [cache-*] is emitted during cached tool invocation paths when tool-level caching is enabled.
  • Low-level storage events (memory.read, memory.write) are emitted internally but suppressed from console output because tool events already surface them.


Further Reading

  • Usage & API — Core classes, storage backends, tools, key search, patterns
  • Cache, Episodic & ACC — Semantic cache, graph-backed episodic memory, context compression