Memory Module¶

The cogent.memory module provides a memory-first architecture where memory is a first-class citizen that can be wired to any agent.

Overview¶

Memory enables agents to: - Persist knowledge across conversations - Share state between agents - Perform semantic search over memories - Scope memories by user, team, or conversation - ACC (Agentic Context Compression) — Bounded context for long conversations

from cogent import Agent
from cogent.memory import Memory

# Basic in-memory storage
memory = Memory()
await memory.remember("user_preference", "dark mode")
value = await memory.recall("user_preference")

# Wire to an agent
agent = Agent(name="assistant", model=model, memory=memory)

# ACC enabled (prevents drift in long conversations)
agent = Agent(name="assistant", model=model, acc=True)

Memory Architecture¶

Cogent provides five distinct memory systems that work together:

System	Parameter	Mechanism	When to Use
Conversation	`conversation=True`	Automatic message concatenation	Short sessions, full context needed
ACC	`acc=True`	Agentic Context Compression	Long conversations, prevent drift
Knowledge	`memory=True` or `Memory(...)`	Agentic tools and/or auto-retrieval	Persistent knowledge, RAG, semantic search
Episodic	`episodic=True`	Graph-backed temporal recall	Cross-session experience, patterns
Cache	`cache=True`	Semantic tool output cache	Expensive/slow tool calls

Conversation History (Automatic)¶

Raw message concatenation - all previous messages automatically sent to LLM:

agent = Agent(name="Assistant", model="gpt4")  # conversation=True by default

await agent.run("Hi, I'm Alice", thread_id="session1")
await agent.run("What's my name?", thread_id="session1")

# Internally sends to LLM:
# [
#   {"role": "user", "content": "Hi, I'm Alice"},
#   {"role": "assistant", "content": "Hello Alice!"},
#   {"role": "user", "content": "What's my name?"}  # <-- Full history
# ]

Characteristics: - ✅ Automatic - No tools needed, no LLM decision required - ✅ Works immediately - LLM sees full context - ✅ Perfect recall - Nothing lost from conversation - ❌ Grows unbounded - Context window fills up over time - ❌ No semantic search - Just chronological concatenation - ❌ Session-bound - Lost when thread ends

When to use: Short sessions where full context fits in window.

ACC (Agentic Context Compression)¶

Compresses growing conversation history into structured constraints and entities:

agent = Agent(name="Assistant", model="gpt4", acc=True)

# After many messages, ACC compresses into:
# Constraints: ["User prefers dark mode", "User timezone is EST", "Project deadline: March 1"]
# Entities: ["Alice (user)", "Project Alpha (active)", "Bob (team lead)"]
# Only compressed context sent to LLM, not full 50-message history

Characteristics: - ✅ Bounded context - Prevents window overflow - ✅ Automatic - No LLM tool calls needed - ✅ Prevents drift - Maintains key facts across long sessions - ✅ Structured - Constraints + Entities format - ❌ Lossy - Some details discarded during compression

When to use: Long conversations that exceed context window.

Knowledge Memory (Agentic or Non-Agentic)¶

Memory supports two access patterns for persistent knowledge, controlled by the tools and retriever parameters on Memory:

Agentic (default) — LLM decides when to use memory¶

agent = Agent(name="Assistant", model="gpt4", memory=True)

# Agent gets memory tools automatically
# LLM decides when to use them:

await agent.run("Remember that I prefer dark mode")
# LLM calls: remember(key="user_preference", value="dark mode")

await agent.run("What's my UI preference?")
# LLM calls: recall(query="user preference")
# Returns: "dark mode"

Non-agentic — automatic retrieval into context¶

from cogent.memory import Memory
from cogent.retrieval import DenseRetriever
from cogent.vectorstore import VectorStore

vs = VectorStore()
retriever = DenseRetriever(vs, score_threshold=0.7)

# Non-agentic only — no tools, auto-retrieval
agent = Agent(
    name="Assistant",
    model="gpt4",
    memory=Memory(retriever=retriever, tools=False),
)

# At each turn, the retriever runs against the user message
# and injects results as a SystemMessage ("# Relevant Knowledge")
await agent.run("Tell me about Python async")

Both — auto-retrieval plus tools¶

# Retriever injects relevant knowledge AND agent can use tools
agent = Agent(
    name="Assistant",
    model="gpt4",
    memory=Memory(retriever=retriever),  # tools=True by default
)

`retriever=`	`tools=`	Behavior
None	True (default)	Agentic only — `memory=True` backward compatible
set	False	Non-agentic only — auto-inject, no tools
set	True	Both — auto-inject + tools

Characteristics (agentic): - ✅ Semantic search - Finds relevant memories by meaning - ✅ Persistent - Survives across sessions/threads - ✅ Selective - LLM stores only important info - ❌ Requires LLM decision - LLM must choose to call tools - ❌ Tool call overhead - Adds latency when used

Characteristics (non-agentic): - ✅ Automatic - Retrieved and injected every turn, no LLM decision - ✅ Backend-agnostic - Works with any BaseRetriever (vector, graph, hybrid) - ✅ Configurable - top_k, min_score, reranking on the retriever - ❌ Always runs - Retrieval cost every turn even if not needed

When to use: Agentic for selective recall. Non-agentic for knowledge bases, documents, and RAG-style workflows where context should always be available.

Episodic Memory (Cross-Session Recall)¶

Graph-backed temporal memory that automatically records each conversation turn, extracts entities and relationships via LLM, and recalls structured facts from the semantic graph:

from cogent.memory import EpisodicMemory

agent = Agent(name="Assistant", model="gpt4", episodic=True)

# Session 1 — turns are recorded, entities are extracted into a graph
await agent.run("How does asyncio work?", thread_id="session-1", user_id="alice")
await agent.run("What about gather()?", thread_id="session-1", user_id="alice")

# Session 2 — structured facts are recalled via graph traversal
await agent.run("Does FastAPI use asyncio?", thread_id="session-2", user_id="alice")
# The agent sees a "Past Experiences" system message with
# entity-relationship facts from the semantic graph

User Identity and Recall Scope¶

Pass user_id to run() to identify who is using the agent. One user can have multiple threads. Episodic memory uses user_id to scope what past experiences are recalled:

# Default: scope="user" — each user sees only their own history
agent = Agent(name="Assistant", model="gpt4", episodic=True)

await agent.run("Rome tips?", thread_id="alice-1", user_id="alice")
await agent.run("French cuisine?", thread_id="bob-1", user_id="bob")

# Alice only recalls her own Rome history, not Bob's cuisine thread
await agent.run("Visit Naples?", thread_id="alice-2", user_id="alice")

The scope parameter on EpisodicMemory controls the recall boundary:

Scope	Recalls	Use case
`"user"` (default)	This user's episodes across all agents	Multi-user agents, shared episodic stores, privacy isolation
`"agent"`	All episodes from this agent regardless of user	Single-user, team/shared memory, knowledge transfer
`"global"`	All episodes across all agents and users	Cross-agent cross-user learning

# Shared team memory — all users benefit from each other's episodes
team_agent = Agent(
    name="TeamAssistant",
    model="gpt4",
    episodic=EpisodicMemory(scope="agent"),
)

user_id is also set on the RunContext metadata, making it accessible to tools and interceptors via ctx.metadata["user_id"].

Reflection (Episode → Lessons)¶

Pass reflection=True to EpisodicMemory to synthesize observations into higher-level lessons after each turn. Lessons are stored as kind="reflection" observations in the episodic graph and written to semantic memory (when available):

agent = Agent(
    name="Assistant",
    model="gpt4",
    episodic=EpisodicMemory(reflection=True),
    # reflection=3 — reflect every 3 turns
)

Reflections answer "what patterns emerge?" rather than "what happened?" — closing the loop between raw experience and reusable knowledge.

Pass a shared EpisodicMemory instance to let multiple agents share the same experience graph:

episodic = EpisodicMemory()

coder = Agent(name="Coder", model="gpt4", episodic=episodic)
reviewer = Agent(name="Reviewer", model="gpt4", episodic=episodic)

Characteristics:

✅ Automatic — Records every turn and recalls relevant experiences without LLM decisions
✅ Cross-session — Observations from any prior thread are searchable
✅ User-scoped — user_id isolates recall boundaries; configurable via scope
✅ Graph-backed — Uses Personalized PageRank for relevance ranking
✅ Shareable — Multiple agents can share one episodic store
✅ Reflective — Optional LLM synthesis distills lessons from raw observations
❌ Keyword-seeded — Recall quality depends on term overlap with past content

When to use: Multi-session agents, learning from past interactions, agents that should improve over time.

Semantic Cache (Tool Outputs)¶

Caches tool results by semantic similarity to avoid redundant calls:

agent = Agent(name="Assistant", model="gpt4", cache=True)

# First call
await agent.run("Search for Python tutorials")  
# Calls search_tool(), caches result

# Similar query (different wording)
await agent.run("Find Python learning resources")  
# Cache hit! Returns previous result without calling search_tool()

Characteristics: - ✅ Speeds up repeated queries - Avoids slow/expensive tool calls - ✅ Semantic matching - Recognizes similar queries - ✅ Transparent - LLM doesn't know cache is used - ❌ Can return stale data - Cached results may be outdated - ❌ Storage overhead - Caches all tool outputs

When to use: Expensive API calls, slow database queries, rate-limited services.

Automatic vs Explicit¶

The key distinction is who decides to use memory:

# Conversation history (AUTOMATIC)
agent = Agent(name="Assistant", model="gpt4")  # conversation=True default
await agent.run("I'm Alice", thread_id="s1")
await agent.run("My name?", thread_id="s1")
# ✅ Works! History automatically sent to LLM
# No tool calls, no LLM decision needed

# Knowledge memory tools (EXPLICIT - LLM decides)
agent = Agent(name="Assistant", model="gpt4", memory=True)
await agent.run("Remember I'm Alice")
# ⚠️ LLM may or may not call remember() - it decides

await agent.run("My name?")
# ⚠️ LLM may or may not call recall() - it decides
# If LLM doesn't call the tool, memory isn't used!

# Non-agentic retrieval (AUTOMATIC)
agent = Agent(
    name="Assistant", model="gpt4",
    memory=Memory(retriever=retriever, tools=False),
)
await agent.run("Tell me about X")
# ✅ Relevant knowledge auto-injected as SystemMessage
# No LLM decision needed — retriever runs every turn

Recommendation: Use conversation memory for short-term context, agentic knowledge memory for selective recall, non-agentic knowledge memory for knowledge bases and RAG workflows.

How Memory Injects Into Context¶

Each memory system reaches the LLM through a different mechanism. Understanding the injection pattern matters for debugging, prompt engineering, and deciding which systems to combine.

Injection Patterns¶

System	Injection	Who Decides	Timing
Conversation	Messages replayed into history	Automatic	Before first LLM call
ACC	Single `SystemMessage` (`# Memory Context`)	Automatic	Before first LLM call (replaces transcript)
Knowledge (agentic)	`ToolMessage` results from `recall()`/`search()`	LLM decides	During agentic loop
Knowledge (non-agentic)	Single `SystemMessage` (`# Relevant Knowledge`)	Automatic	Before first LLM call
Episodic	Single `SystemMessage` (`# Past Experiences`)	Automatic	Before first LLM call
Cache	Cached `ToolMessage` (substituted transparently)	Automatic	During tool execution

What the LLM Sees¶

When multiple layers are active, the message array sent to the LLM follows this order:

┌─────────────────────────────────────────────────────┐
│ SystemMessage   — agent instructions                │  always
├─────────────────────────────────────────────────────┤
│ SystemMessage   — # Memory Context                  │  ACC (if enabled)
│                   compressed constraints + entities  │
├─────────────────────────────────────────────────────┤
│ SystemMessage   — # Relevant Knowledge              │  Knowledge non-agentic (if retriever set)
│                   auto-retrieved documents           │
├─────────────────────────────────────────────────────┤
│ SystemMessage   — # Past Experiences                │  Episodic (if enabled)
│                   recalled observations from graph  │
├─────────────────────────────────────────────────────┤
│ HumanMessage    — prior user messages               │  Conversation only (skipped when ACC active)
│ AIMessage       — prior assistant replies            │
│ ...             — full transcript                    │
├─────────────────────────────────────────────────────┤
│ HumanMessage    — current user task                 │  always
├─────────────────────────────────────────────────────┤
│ AIMessage       — tool calls (if any)               │  agentic loop
│ ToolMessage     — tool results / recall / cache hit │
│ AIMessage       — final response                    │
└─────────────────────────────────────────────────────┘

Key Interactions¶

Conversation + ACC: Conversation always records the raw transcript. When ACC is enabled, the LLM sees a compressed SystemMessage instead of replayed history messages. Conversation still stores every exchange — ACC just controls what the model sees.

Knowledge memory has two modes: With tools=True (agentic), the LLM must decide to call recall() or search_memories() — nothing is injected automatically, so the agent can miss facts if it does not think to look. With retriever= (non-agentic), relevant knowledge is auto-retrieved and injected as a SystemMessage before the LLM runs. Both modes can be active simultaneously.

Episodic is automatic: Past experiences matching the current query are retrieved and injected before the LLM runs. The agent never calls a tool — it simply receives relevant observations as background context.

Cache is invisible: The LLM never knows caching happened. When a tool call matches a cached entry, the cached result is returned as the tool's output. The agent sees the same ToolMessage it would have seen without cache — just faster.

Context Budget and Composition¶

When multiple memory systems are active, each one injects content into the context window independently. There is no shared budget or coordination between them. Understanding what ACC compresses — and what it does not — is critical for avoiding context overflow in production.

What ACC Compresses¶

ACC replaces the conversation transcript only. After a turn completes, update_from_turn() receives exactly two inputs:

user_message — the current user task
assistant_message — the final model response

It does not receive or compress:

The # Relevant Knowledge SystemMessage (non-agentic retrieval)
The # Past Experiences SystemMessage (episodic recall)
Tool call results from the agentic loop (including cached results)

This means ACC bounds the conversation history portion of the context, but episodic and knowledge injections are re-fetched fresh every turn and added on top.

Independent Injection Pipelines¶

Each system is a separate pipe into the context window:

Context Window
├── Instructions (fixed)
├── # Memory Context          ← ACC-bounded (constraints + entities)
├── # Relevant Knowledge      ← unbounded (retriever controls volume)
├── # Past Experiences        ← unbounded (top_k observations)
├── Current task (fixed)
└── Agentic loop (tool calls + cache hits)

ACC keeps its own section bounded, but cannot shrink what the other systems inject. If the retriever returns 20 paragraphs or episodic returns 50 observations, that context is added every turn regardless of ACC.

Practical Implications¶

Scenario	Risk	Mitigation
ACC + large knowledge base	Retriever results fill window despite ACC bounds	Set `top_k` and `score_threshold` on the retriever
ACC + many episodic sessions	Observations accumulate across sessions	Episodic `top_k=5` (default) limits per-turn injection
ACC + non-agentic + episodic	Three SystemMessages compete for space	Monitor total injected tokens via observer events
All systems active	No single system sees the total budget	Use observer to track combined context size

Recommendations¶

Tune retriever limits. Set top_k and score_threshold on DenseRetriever to cap how much knowledge is injected per turn. You can also pass retrieval_k on Agent to override the retriever's default at the agent level.
Monitor with observer. Enable memory_events=True to see how much each system injects. Watch [memory-retrieved] result counts and [episodic-recalled] observation counts.
Cap episodic recall. The default top_k=5 is conservative. Increase only when the agent genuinely benefits from more cross-session context.
Set a memory budget. Use memory_budget on Agent to cap each injected SystemMessage to a maximum character count. Content exceeding the limit is truncated with a [truncated] marker.
Agentic knowledge is self-regulating. When tools=True, the LLM decides whether to call recall() or search_memories(). Only results the LLM requests enter the context. Non-agentic mode injects every turn.
Cache does not affect context size. It substitutes tool results transparently — the injected ToolMessage is the same size whether cached or fresh.

Agent Parameters for Context Control¶

from cogent import Agent
from cogent.memory import EpisodicMemory

agent = Agent(
    name="Assistant",
    model="gpt-5.4",
    acc=True,
    episodic=EpisodicMemory(top_k=3, scope="user"),
    memory=Memory(retriever=retriever, tools=False),
    # Context budget controls
    memory_budget=8000,     # Max chars per injected SystemMessage
    retrieval_k=2,          # Retrieve 2 documents instead of retriever default
)

Agent-level parameters:

Parameter	Default	Controls
`memory_budget`	`None` (unlimited)	Max chars per injected SystemMessage (ACC, episodic, knowledge)
`retrieval_k`	`None` (retriever default)	Number of retriever results per turn

EpisodicMemory parameters:

Parameter	Default	Controls
`top_k`	`5`	Number of facts recalled per turn
`scope`	`"user"`	Recall boundary: `"user"`, `"agent"`, or `"global"`
`reflection`	`False`	Reflection cadence: `True` (every turn) or `int` N (every N turns)
`model`	`None`	Chat model for entity extraction (set automatically by Agent)

Observability¶

The memory stack emits lifecycle events so memory behavior is not a black box. Enable with Observer(memory_events=True) or level="debug". The agent's observer auto-propagates to Memory — no explicit Memory(observer=...) wiring is needed.

Each memory system gets a distinct label prefix in the console so you can tell at a glance which system emitted an event:

System	Console Label	Events	What You See
Conversation	`[conversation-*]`	`loaded`, `saved`	Thread history resume/save, message counts
ACC	`[acc-*]`	`loaded`, `saved`, `context`, `updated`	Bounded state size, item breakdown
Knowledge (agentic)	`[tool-*]`	`called`, `result`	Agent decisions to remember/recall/search
Knowledge (non-agentic)	`[memory-*]`	`retrieved`	Retriever name, query, scores, duration
Episodic	`[episodic-*]`	`recalled`, `recorded`, `reflected`	Observation counts, source episodes, query, lesson count
Cache	`[cache-*]`	`hit`, `miss`, `write`	Tool name, cache key, similarity score

Notes:

Agentic knowledge memory uses [tool-*] events from the ToolFormatter; non-agentic uses [memory-retrieved].
[tool-*] events remain the canonical signal that an agent selected and used a tool.
[cache-*] is emitted during cached tool invocation paths when tool-level caching is enabled.
Low-level storage events (memory.read, memory.write) are emitted internally but suppressed from console output because tool events already surface them.