Memory Architecture Stack

Layered Architecture

Four Layers of Agent Memory

Each layer serves a different purpose, operates at a different time scale, and competes for space in the same finite context window.

Context Window · 128K Tokens

🏗

System Prompt

Identity, guidelines, tools

💬

Conversation History

Turn-by-turn session continuity

🔍

Retrieved Knowledge

Query-relevant RAG documents

🧠

Persistent Memory

Cross-session entity knowledge

System Prompt

Static

"Who am I and what are my rules?"

Foundation layer. Contains agent identity, behavioral guidelines, tool descriptions, and standing instructions. Changes rarely — only when the agent is reconfigured.

Conversation History

Dynamic

"What did we just discuss?"

Most dynamic layer. Grows with each turn, may need summarization or truncation as it approaches window limits. Provides turn-by-turn continuity within the current session.

Retrieved Knowledge (RAG)

Query-dependent

"What does the organization know about this?"

Domain-specific information drawn from knowledge bases. Populated dynamically based on the current query. Changes with each turn as the topic shifts.

Persistent Memory

Slow-moving

"What do we know about this customer?"

Cross-session context about entities, relationships, and learned preferences. Updates only when the agent encounters information worth retaining.

Fundamental Property

Every Call Starts from Zero

The model's weights do not change between calls. No internal register updates. The illusion of memory is entirely constructed by the application layer.

API Call #1

Processes claim, learns client is risk-averse

Context loaded

State reset

API Call #2

Starts fresh — knows nothing from Call #1

Blank slate

With Memory System

External store re-injects "risk-averse" into context

Context restored

The model never learns in real-time

Its weights do not update. Its capabilities do not improve through use. Every perceived adaptation is the product of external systems feeding better information into the same stateless engine.

Approaches

Three Memory Strategies

Production agents combine these approaches into a layered architecture, allocating context window budget dynamically based on the current task.

Conversation History

Prepend prior exchanges to each request. Works well for short, contained sessions. Breaks down as conversations grow longer.

Scope Within session

Scale Summarize at limits

Tradeoff Continuity vs. granularity

RAG

Retrieve relevant chunks from external knowledge bases and inject into context. Connects agents to organizational knowledge they were never trained on.

Scope Organizational knowledge

Scale Scales with embeddings

Tradeoff Precision vs. noise

External Memory

Persist agent-learned facts to structured stores. Cross-session knowledge about entities, preferences, and learned behaviors.

Scope Cross-session

Scale Selective persistence

Tradeoff Retention vs. noise