Memory Module¶
The cogent.memory module provides a memory-first architecture where memory is a first-class citizen that can be wired to any agent.
Overview¶
Memory enables agents to: - Persist knowledge across conversations - Share state between agents - Perform semantic search over memories - Scope memories by user, team, or conversation - ACC (Agentic Context Compression) — Bounded context for long conversations
from cogent import Agent
from cogent.memory import Memory
# Basic in-memory storage
memory = Memory()
await memory.remember("user_preference", "dark mode")
value = await memory.recall("user_preference")
# Wire to an agent
agent = Agent(name="assistant", model=model, memory=memory)
# ACC enabled (prevents drift in long conversations)
agent = Agent(name="assistant", model=model, acc=True)
Memory Architecture¶
Cogent provides five distinct memory systems that work together:
| System | Parameter | Mechanism | When to Use |
|---|---|---|---|
| Conversation | conversation=True |
Automatic message concatenation | Short sessions, full context needed |
| ACC | acc=True |
Agentic Context Compression | Long conversations, prevent drift |
| Knowledge | memory=True or Memory(...) |
Agentic tools and/or auto-retrieval | Persistent knowledge, RAG, semantic search |
| Episodic | episodic=True |
Graph-backed temporal recall | Cross-session experience, patterns |
| Cache | cache=True |
Semantic tool output cache | Expensive/slow tool calls |
Conversation History (Automatic)¶
Raw message concatenation - all previous messages automatically sent to LLM:
agent = Agent(name="Assistant", model="gpt4") # conversation=True by default
await agent.run("Hi, I'm Alice", thread_id="session1")
await agent.run("What's my name?", thread_id="session1")
# Internally sends to LLM:
# [
# {"role": "user", "content": "Hi, I'm Alice"},
# {"role": "assistant", "content": "Hello Alice!"},
# {"role": "user", "content": "What's my name?"} # <-- Full history
# ]
Characteristics: - ✅ Automatic - No tools needed, no LLM decision required - ✅ Works immediately - LLM sees full context - ✅ Perfect recall - Nothing lost from conversation - ❌ Grows unbounded - Context window fills up over time - ❌ No semantic search - Just chronological concatenation - ❌ Session-bound - Lost when thread ends
When to use: Short sessions where full context fits in window.
ACC (Agentic Context Compression)¶
Compresses growing conversation history into structured constraints and entities:
agent = Agent(name="Assistant", model="gpt4", acc=True)
# After many messages, ACC compresses into:
# Constraints: ["User prefers dark mode", "User timezone is EST", "Project deadline: March 1"]
# Entities: ["Alice (user)", "Project Alpha (active)", "Bob (team lead)"]
# Only compressed context sent to LLM, not full 50-message history
Characteristics: - ✅ Bounded context - Prevents window overflow - ✅ Automatic - No LLM tool calls needed - ✅ Prevents drift - Maintains key facts across long sessions - ✅ Structured - Constraints + Entities format - ❌ Lossy - Some details discarded during compression
When to use: Long conversations that exceed context window.
Knowledge Memory (Agentic or Non-Agentic)¶
Memory supports two access patterns for persistent knowledge, controlled by
the tools and retriever parameters on Memory:
Agentic (default) — LLM decides when to use memory¶
agent = Agent(name="Assistant", model="gpt4", memory=True)
# Agent gets memory tools automatically
# LLM decides when to use them:
await agent.run("Remember that I prefer dark mode")
# LLM calls: remember(key="user_preference", value="dark mode")
await agent.run("What's my UI preference?")
# LLM calls: recall(query="user preference")
# Returns: "dark mode"
Non-agentic — automatic retrieval into context¶
from cogent.memory import Memory
from cogent.retrieval import DenseRetriever
from cogent.vectorstore import VectorStore
vs = VectorStore()
retriever = DenseRetriever(vs, score_threshold=0.7)
# Non-agentic only — no tools, auto-retrieval
agent = Agent(
name="Assistant",
model="gpt4",
memory=Memory(retriever=retriever, tools=False),
)
# At each turn, the retriever runs against the user message
# and injects results as a SystemMessage ("# Relevant Knowledge")
await agent.run("Tell me about Python async")
Both — auto-retrieval plus tools¶
# Retriever injects relevant knowledge AND agent can use tools
agent = Agent(
name="Assistant",
model="gpt4",
memory=Memory(retriever=retriever), # tools=True by default
)
retriever= |
tools= |
Behavior |
|---|---|---|
| None | True (default) | Agentic only — memory=True backward compatible |
| set | False | Non-agentic only — auto-inject, no tools |
| set | True | Both — auto-inject + tools |
Characteristics (agentic): - ✅ Semantic search - Finds relevant memories by meaning - ✅ Persistent - Survives across sessions/threads - ✅ Selective - LLM stores only important info - ❌ Requires LLM decision - LLM must choose to call tools - ❌ Tool call overhead - Adds latency when used
Characteristics (non-agentic):
- ✅ Automatic - Retrieved and injected every turn, no LLM decision
- ✅ Backend-agnostic - Works with any BaseRetriever (vector, graph, hybrid)
- ✅ Configurable - top_k, min_score, reranking on the retriever
- ❌ Always runs - Retrieval cost every turn even if not needed
When to use: Agentic for selective recall. Non-agentic for knowledge bases, documents, and RAG-style workflows where context should always be available.
Episodic Memory (Cross-Session Recall)¶
Graph-backed temporal memory that automatically records each conversation turn, extracts entities and relationships via LLM, and recalls structured facts from the semantic graph:
from cogent.memory import EpisodicMemory
agent = Agent(name="Assistant", model="gpt4", episodic=True)
# Session 1 — turns are recorded, entities are extracted into a graph
await agent.run("How does asyncio work?", thread_id="session-1", user_id="alice")
await agent.run("What about gather()?", thread_id="session-1", user_id="alice")
# Session 2 — structured facts are recalled via graph traversal
await agent.run("Does FastAPI use asyncio?", thread_id="session-2", user_id="alice")
# The agent sees a "Past Experiences" system message with
# entity-relationship facts from the semantic graph
User Identity and Recall Scope¶
Pass user_id to run() to identify who is using the agent. One user can have multiple threads. Episodic memory uses user_id to scope what past experiences are recalled:
# Default: scope="user" — each user sees only their own history
agent = Agent(name="Assistant", model="gpt4", episodic=True)
await agent.run("Rome tips?", thread_id="alice-1", user_id="alice")
await agent.run("French cuisine?", thread_id="bob-1", user_id="bob")
# Alice only recalls her own Rome history, not Bob's cuisine thread
await agent.run("Visit Naples?", thread_id="alice-2", user_id="alice")
The scope parameter on EpisodicMemory controls the recall boundary:
| Scope | Recalls | Use case |
|---|---|---|
"user" (default) |
This user's episodes across all agents | Multi-user agents, shared episodic stores, privacy isolation |
"agent" |
All episodes from this agent regardless of user | Single-user, team/shared memory, knowledge transfer |
"global" |
All episodes across all agents and users | Cross-agent cross-user learning |
# Shared team memory — all users benefit from each other's episodes
team_agent = Agent(
name="TeamAssistant",
model="gpt4",
episodic=EpisodicMemory(scope="agent"),
)
user_id is also set on the RunContext metadata, making it accessible to tools and interceptors via ctx.metadata["user_id"].
Reflection (Episode → Lessons)¶
Pass reflection=True to EpisodicMemory to synthesize observations into higher-level lessons after each turn. Lessons are stored as kind="reflection" observations in the episodic graph and written to semantic memory (when available):
agent = Agent(
name="Assistant",
model="gpt4",
episodic=EpisodicMemory(reflection=True),
# reflection=3 — reflect every 3 turns
)
Reflections answer "what patterns emerge?" rather than "what happened?" — closing the loop between raw experience and reusable knowledge.
Pass a shared EpisodicMemory instance to let multiple agents share the same experience graph:
episodic = EpisodicMemory()
coder = Agent(name="Coder", model="gpt4", episodic=episodic)
reviewer = Agent(name="Reviewer", model="gpt4", episodic=episodic)
Characteristics:
- ✅ Automatic — Records every turn and recalls relevant experiences without LLM decisions
- ✅ Cross-session — Observations from any prior thread are searchable
- ✅ User-scoped —
user_idisolates recall boundaries; configurable viascope - ✅ Graph-backed — Uses Personalized PageRank for relevance ranking
- ✅ Shareable — Multiple agents can share one episodic store
- ✅ Reflective — Optional LLM synthesis distills lessons from raw observations
- ❌ Keyword-seeded — Recall quality depends on term overlap with past content
When to use: Multi-session agents, learning from past interactions, agents that should improve over time.
Semantic Cache (Tool Outputs)¶
Caches tool results by semantic similarity to avoid redundant calls:
agent = Agent(name="Assistant", model="gpt4", cache=True)
# First call
await agent.run("Search for Python tutorials")
# Calls search_tool(), caches result
# Similar query (different wording)
await agent.run("Find Python learning resources")
# Cache hit! Returns previous result without calling search_tool()
Characteristics: - ✅ Speeds up repeated queries - Avoids slow/expensive tool calls - ✅ Semantic matching - Recognizes similar queries - ✅ Transparent - LLM doesn't know cache is used - ❌ Can return stale data - Cached results may be outdated - ❌ Storage overhead - Caches all tool outputs
When to use: Expensive API calls, slow database queries, rate-limited services.
Automatic vs Explicit¶
The key distinction is who decides to use memory:
# Conversation history (AUTOMATIC)
agent = Agent(name="Assistant", model="gpt4") # conversation=True default
await agent.run("I'm Alice", thread_id="s1")
await agent.run("My name?", thread_id="s1")
# ✅ Works! History automatically sent to LLM
# No tool calls, no LLM decision needed
# Knowledge memory tools (EXPLICIT - LLM decides)
agent = Agent(name="Assistant", model="gpt4", memory=True)
await agent.run("Remember I'm Alice")
# ⚠️ LLM may or may not call remember() - it decides
await agent.run("My name?")
# ⚠️ LLM may or may not call recall() - it decides
# If LLM doesn't call the tool, memory isn't used!
# Non-agentic retrieval (AUTOMATIC)
agent = Agent(
name="Assistant", model="gpt4",
memory=Memory(retriever=retriever, tools=False),
)
await agent.run("Tell me about X")
# ✅ Relevant knowledge auto-injected as SystemMessage
# No LLM decision needed — retriever runs every turn
Recommendation: Use conversation memory for short-term context, agentic knowledge memory for selective recall, non-agentic knowledge memory for knowledge bases and RAG workflows.
How Memory Injects Into Context¶
Each memory system reaches the LLM through a different mechanism. Understanding the injection pattern matters for debugging, prompt engineering, and deciding which systems to combine.
Injection Patterns¶
| System | Injection | Who Decides | Timing |
|---|---|---|---|
| Conversation | Messages replayed into history | Automatic | Before first LLM call |
| ACC | Single SystemMessage (# Memory Context) |
Automatic | Before first LLM call (replaces transcript) |
| Knowledge (agentic) | ToolMessage results from recall()/search() |
LLM decides | During agentic loop |
| Knowledge (non-agentic) | Single SystemMessage (# Relevant Knowledge) |
Automatic | Before first LLM call |
| Episodic | Single SystemMessage (# Past Experiences) |
Automatic | Before first LLM call |
| Cache | Cached ToolMessage (substituted transparently) |
Automatic | During tool execution |
What the LLM Sees¶
When multiple layers are active, the message array sent to the LLM follows this order:
┌─────────────────────────────────────────────────────┐
│ SystemMessage — agent instructions │ always
├─────────────────────────────────────────────────────┤
│ SystemMessage — # Memory Context │ ACC (if enabled)
│ compressed constraints + entities │
├─────────────────────────────────────────────────────┤
│ SystemMessage — # Relevant Knowledge │ Knowledge non-agentic (if retriever set)
│ auto-retrieved documents │
├─────────────────────────────────────────────────────┤
│ SystemMessage — # Past Experiences │ Episodic (if enabled)
│ recalled observations from graph │
├─────────────────────────────────────────────────────┤
│ HumanMessage — prior user messages │ Conversation only (skipped when ACC active)
│ AIMessage — prior assistant replies │
│ ... — full transcript │
├─────────────────────────────────────────────────────┤
│ HumanMessage — current user task │ always
├─────────────────────────────────────────────────────┤
│ AIMessage — tool calls (if any) │ agentic loop
│ ToolMessage — tool results / recall / cache hit │
│ AIMessage — final response │
└─────────────────────────────────────────────────────┘
Key Interactions¶
Conversation + ACC:
Conversation always records the raw transcript. When ACC is enabled, the LLM sees
a compressed SystemMessage instead of replayed history messages. Conversation still
stores every exchange — ACC just controls what the model sees.
Knowledge memory has two modes:
With tools=True (agentic), the LLM must decide to call recall() or
search_memories() — nothing is injected automatically, so the agent can
miss facts if it does not think to look. With retriever= (non-agentic),
relevant knowledge is auto-retrieved and injected as a SystemMessage
before the LLM runs. Both modes can be active simultaneously.
Episodic is automatic: Past experiences matching the current query are retrieved and injected before the LLM runs. The agent never calls a tool — it simply receives relevant observations as background context.
Cache is invisible:
The LLM never knows caching happened. When a tool call matches a cached
entry, the cached result is returned as the tool's output. The agent sees
the same ToolMessage it would have seen without cache — just faster.
Context Budget and Composition¶
When multiple memory systems are active, each one injects content into the context window independently. There is no shared budget or coordination between them. Understanding what ACC compresses — and what it does not — is critical for avoiding context overflow in production.
What ACC Compresses¶
ACC replaces the conversation transcript only. After a turn completes,
update_from_turn() receives exactly two inputs:
user_message— the current user taskassistant_message— the final model response
It does not receive or compress:
- The
# Relevant KnowledgeSystemMessage (non-agentic retrieval) - The
# Past ExperiencesSystemMessage (episodic recall) - Tool call results from the agentic loop (including cached results)
This means ACC bounds the conversation history portion of the context, but episodic and knowledge injections are re-fetched fresh every turn and added on top.
Independent Injection Pipelines¶
Each system is a separate pipe into the context window:
Context Window
├── Instructions (fixed)
├── # Memory Context ← ACC-bounded (constraints + entities)
├── # Relevant Knowledge ← unbounded (retriever controls volume)
├── # Past Experiences ← unbounded (top_k observations)
├── Current task (fixed)
└── Agentic loop (tool calls + cache hits)
ACC keeps its own section bounded, but cannot shrink what the other systems inject. If the retriever returns 20 paragraphs or episodic returns 50 observations, that context is added every turn regardless of ACC.
Practical Implications¶
| Scenario | Risk | Mitigation |
|---|---|---|
| ACC + large knowledge base | Retriever results fill window despite ACC bounds | Set top_k and score_threshold on the retriever |
| ACC + many episodic sessions | Observations accumulate across sessions | Episodic top_k=5 (default) limits per-turn injection |
| ACC + non-agentic + episodic | Three SystemMessages compete for space | Monitor total injected tokens via observer events |
| All systems active | No single system sees the total budget | Use observer to track combined context size |
Recommendations¶
- Tune retriever limits. Set
top_kandscore_thresholdonDenseRetrieverto cap how much knowledge is injected per turn. You can also passretrieval_kon Agent to override the retriever's default at the agent level. - Monitor with observer. Enable
memory_events=Trueto see how much each system injects. Watch[memory-retrieved]result counts and[episodic-recalled]observation counts. - Cap episodic recall. The default
top_k=5is conservative. Increase only when the agent genuinely benefits from more cross-session context. - Set a memory budget. Use
memory_budgeton Agent to cap each injected SystemMessage to a maximum character count. Content exceeding the limit is truncated with a[truncated]marker. - Agentic knowledge is self-regulating. When
tools=True, the LLM decides whether to callrecall()orsearch_memories(). Only results the LLM requests enter the context. Non-agentic mode injects every turn. - Cache does not affect context size. It substitutes tool results
transparently — the injected
ToolMessageis the same size whether cached or fresh.
Agent Parameters for Context Control¶
from cogent import Agent
from cogent.memory import EpisodicMemory
agent = Agent(
name="Assistant",
model="gpt-5.4",
acc=True,
episodic=EpisodicMemory(top_k=3, scope="user"),
memory=Memory(retriever=retriever, tools=False),
# Context budget controls
memory_budget=8000, # Max chars per injected SystemMessage
retrieval_k=2, # Retrieve 2 documents instead of retriever default
)
Agent-level parameters:
| Parameter | Default | Controls |
|---|---|---|
memory_budget |
None (unlimited) |
Max chars per injected SystemMessage (ACC, episodic, knowledge) |
retrieval_k |
None (retriever default) |
Number of retriever results per turn |
EpisodicMemory parameters:
| Parameter | Default | Controls |
|---|---|---|
top_k |
5 |
Number of facts recalled per turn |
scope |
"user" |
Recall boundary: "user", "agent", or "global" |
reflection |
False |
Reflection cadence: True (every turn) or int N (every N turns) |
model |
None |
Chat model for entity extraction (set automatically by Agent) |
Observability¶
The memory stack emits lifecycle events so memory behavior is not a black box.
Enable with Observer(memory_events=True) or level="debug". The agent's observer
auto-propagates to Memory — no explicit Memory(observer=...) wiring is needed.
Each memory system gets a distinct label prefix in the console so you can tell at a glance which system emitted an event:
| System | Console Label | Events | What You See |
|---|---|---|---|
| Conversation | [conversation-*] |
loaded, saved |
Thread history resume/save, message counts |
| ACC | [acc-*] |
loaded, saved, context, updated |
Bounded state size, item breakdown |
| Knowledge (agentic) | [tool-*] |
called, result |
Agent decisions to remember/recall/search |
| Knowledge (non-agentic) | [memory-*] |
retrieved |
Retriever name, query, scores, duration |
| Episodic | [episodic-*] |
recalled, recorded, reflected |
Observation counts, source episodes, query, lesson count |
| Cache | [cache-*] |
hit, miss, write |
Tool name, cache key, similarity score |
Notes:
- Agentic knowledge memory uses
[tool-*]events from the ToolFormatter; non-agentic uses[memory-retrieved]. [tool-*]events remain the canonical signal that an agent selected and used a tool.[cache-*]is emitted during cached tool invocation paths when tool-level caching is enabled.- Low-level storage events (
memory.read,memory.write) are emitted internally but suppressed from console output because tool events already surface them.
Further Reading¶
- Usage & API — Core classes, storage backends, tools, key search, patterns
- Cache, Episodic & ACC — Semantic cache, graph-backed episodic memory, context compression