Module 0 · The AI Substrate

Token Budget & Context Window

Tokens are the atomic unit of cost, latency, and attention in every LLM interaction. The context window is the finite working memory that must accommodate everything the agent needs to know, reason about, and produce.

Context Window Allocation
Where Do the Tokens Go?
A 128K-token context window seems enormous — until you account for everything an agent needs to carry.
128,000 Token Context Window 35,000 tokens consumed before the user's message
History
Retrieved Context
Available Capacity — 89,000 tokens
System Prompt Identity, guidelines, constraints
2,000 tokens · 1.6%
Tool Definitions Names, descriptions, schemas
3,000 tokens · 2.3%
Conversation History Prior turns in the session
10,000 tokens · 7.8%
Retrieved Context RAG documents, knowledge
20,000 tokens · 15.6%
Model Response Generated output tokens
~4,000 tokens · 3.1%
Available Capacity Remaining headroom
89,000 tokens · 69.5%
Three Dimensions

Cost

Every token has a price. Output tokens cost 3–5× more than input tokens. Costs multiply with each turn.

Input: $3–30/M tokens
Output: $15–100/M tokens

Latency

Output tokens are generated sequentially. More tokens = longer wait. 2,000-token response takes 10× longer than 200 tokens.

First token: 200–500ms
Full response: 2–15 seconds

Attention

Tokens are the unit of attention. Reasoning quality degrades as token count grows. More tokens ≠ better results.

Quality peaks: first & last
Degraded: middle positions
Attention Pattern
The "Lost in the Middle" Effect
Models attend most strongly to information at the beginning and end of the context window. Information buried in the middle receives significantly less attention — even when it's the most relevant.
← Beginning of context End of context →
Low attention
High attention

Architectural Implication

Place critical information — system instructions, key facts, decision criteria — at the beginning or end of the context. Use retrieval strategies that surface only the most relevant chunks rather than flooding the context with marginally related content.

Multi-Agent Impact
Context Handoff: Naive vs. Compressed
Every agent-to-agent handoff serializes context into tokens. The handoff design determines whether costs compound or stay controlled.
Naive
Retrieval Agent
50K tokens
Analysis Agent
50K + 30K
Summary Agent
80K + 20K
Total
230K
Compressed
Retrieval Agent
50K tokens
Analysis Agent
10K + 30K
Summary Agent
8K + 20K
Total
118K