Every architectural decision you make when building AI agents traces back to the same set of constraints: how the underlying language model actually processes information. Not how it seems to process information, not the abstraction your SDK presents, but the raw mechanics of tokens flowing through a neural network with finite capacity. Teams that ignore these mechanics build agents that are slow, expensive, unreliable, or all three. Teams that understand them make better trade-offs at every level of the stack, from prompt design to memory architecture to multi-agent orchestration.

The gap between “I know LLMs use tokens” and “I understand how token economics shape my agent’s cost structure” is the gap between a demo and a production system. This is the operational reality that underpins everything else in agent design.

What Tokens Actually Are

A token is the fundamental unit of text that a large language model reads and generates. It is not a word, not a character, and not a sentence. It is a chunk of text, typically three to four characters in English, determined by a tokenization algorithm that the model learned during training. The word “understanding” might be two tokens (“under” and “standing”), while a common short word like “the” is a single token. Punctuation, whitespace, code syntax, and special characters all consume tokens. A number like “2026” might be one token or two, depending on the model’s vocabulary.

This matters for three reasons that directly affect agent design. First, every token has a cost. When you call a model API, you pay per input token and per output token, with output tokens typically costing three to five times more than input tokens. A customer support agent that includes the entire conversation history in every call is multiplying its cost with each turn. A claims processing agent that dumps an entire policy document into the prompt when it only needs a coverage summary is spending money on tokens that add no value. Second, every token has a latency cost. Models generate output tokens sequentially, one at a time. The more tokens your agent requests in a response, the longer the user waits. An agent that produces a 2,000-token response when a 200-token response would suffice is not just wasteful, it is slow. Third, tokens are the unit of attention. The model processes relationships between tokens, and its ability to reason about information degrades as the number of tokens grows. More tokens do not always mean better results.

Enterprise teams often discover these realities the hard way. A financial analysis agent that seemed affordable in testing becomes prohibitively expensive when processing hundreds of earnings reports per day. A document review agent that worked well on five-page contracts begins producing incoherent summaries on fifty-page agreements. The root cause in both cases is a failure to account for the operational characteristics of tokens: their cost, their latency, and the cognitive load they place on the model.

The Context Window: A Hard Boundary

The context window is the total number of tokens a model can process in a single call, encompassing both the input you provide and the output the model generates. Modern frontier models offer context windows ranging from 128,000 to over one million tokens. These numbers sound enormous, and they have created a dangerous assumption: that you can simply “throw everything at the model” and let it figure out what matters.

This assumption fails in practice for several reasons. The first is economic. A context window of 200,000 tokens filled on every call at current API pricing can cost dollars per interaction, not cents. An enterprise customer support agent handling thousands of conversations per day will generate bills that make the business case collapse. The second reason is performance degradation. Research consistently demonstrates that model accuracy decreases as context length increases, particularly for information located in the middle of the input. This is sometimes called the “lost in the middle” effect: models attend most strongly to information at the beginning and end of the context, while information buried in the center receives less attention. An agent that packs a 100-page policy manual into its context window may actually perform worse on questions about specific clauses than one given only the relevant sections.

The third reason is architectural. The context window is not just a size limit; it is the model’s entire working memory for a given interaction. Everything the agent needs to know, reason about, and act on must fit within this window. The system prompt that defines the agent’s role and constraints consumes tokens. The conversation history consumes tokens. Any retrieved documents or data consume tokens. The tools available to the agent and their descriptions consume tokens. And the model’s response consumes tokens from the same budget. When you add it all up, the “enormous” context window starts to feel considerably smaller. A system prompt of 2,000 tokens, a tool registry of 3,000 tokens, a conversation history of 10,000 tokens, and retrieved context of 20,000 tokens leaves you 93,000 tokens short of the theoretical maximum in a 128K model, but you are already carrying 35,000 tokens of overhead before the model has read a single new user message.

This is why agent memory architecture matters. You cannot store everything in the context window, which means you need strategies for deciding what goes in and what stays out. Retrieval-augmented generation, conversation summarization, sliding window history, and hierarchical memory systems all exist because the context window is a finite, expensive, performance-sensitive resource. Every one of these strategies is a direct response to the operational constraints of the context window.

Token Budget & Context Window
How tokens are allocated within the context window — and why 128K tokens is smaller than it sounds.

In-Context Learning: Power and Limits

In-context learning is the mechanism by which a language model adapts its behavior based on the information and examples provided in the prompt, without any change to the model’s underlying weights. When you include three examples of correctly formatted JSON in your prompt and the model produces correctly formatted JSON in response, that is in-context learning. When you provide a system prompt that says “You are a claims adjuster specializing in property damage” and the model begins reasoning about deductibles and coverage limits, that is in-context learning. The model has not been retrained. It has not been fine-tuned. It is pattern-matching against the context you have provided.

In-context learning is extraordinarily powerful and is the primary mechanism through which agents acquire task-specific behavior. A well-constructed system prompt with a few carefully chosen examples can turn a general-purpose model into a remarkably capable specialist for a narrow domain. A customer support agent that includes examples of good and bad responses directly in its prompt will produce higher-quality interactions than one with only abstract instructions. A financial analysis agent that includes a sample analysis alongside the data it needs to process will produce output that matches the expected format, terminology, and level of detail.

But in-context learning has sharp limits that directly constrain agent design. Every example, every instruction, every piece of context you provide to enable in-context learning consumes tokens from the same finite context window. There is a direct trade-off between the richness of the agent’s instructions and the amount of working data it can process in a single call. A claims processing agent with a 5,000-token system prompt detailing every edge case and exception has 5,000 fewer tokens available for the actual claim data. At some point, adding more instructional context produces diminishing returns or actively harms performance by crowding out the information the model needs to do its job.

Furthermore, in-context learning is volatile. It persists only for the duration of a single API call. When the next request arrives, the model has no memory of what it learned from the previous one. This is fundamentally different from how humans learn and is one of the most common sources of confusion when teams first build agents. The agent that correctly processed a complex edge case on one call will not “remember” that case on the next call unless you explicitly include the relevant information again. Every call starts from zero. This is why agents require persistent memory systems external to the model itself, and why conversation history management is not a nice-to-have feature but a core architectural requirement.

What This Means for Agent Architecture

These three operational realities – token economics, context window constraints, and the nature of in-context learning – converge to define the fundamental design tensions in agent architecture. Every production agent must answer a set of questions that flow directly from these constraints.

How much context does the agent carry per call? More context means better-informed responses but higher cost, higher latency, and potential performance degradation. Less context means cheaper and faster calls but risks the agent lacking information it needs. This is not a one-time configuration decision; it is a continuous optimization problem that changes as the agent’s workload evolves. An agent processing simple order status queries needs minimal context. The same agent handling a complex dispute resolution needs substantially more. Effective agent architectures adapt their context loading strategy based on the nature of each individual request.

How does the agent manage its memory across interactions? Since in-context learning resets with every call, any continuity must be engineered explicitly. This means choosing which parts of a conversation to preserve, which to summarize, and which to discard. It means building retrieval systems that can surface relevant prior interactions without flooding the context window. And it means accepting that the agent’s “memory” is an approximation, not a perfect record, and designing for the errors that approximation introduces.

How does the agent balance instruction richness against working capacity? A detailed system prompt with extensive examples produces more reliable behavior but leaves less room for the actual data the agent needs to process. A minimal system prompt preserves context capacity but may lead to inconsistent or incorrect behavior. The right balance depends on the complexity of the task, the variability of the inputs, and the consequences of errors. High-stakes tasks like financial compliance reviews warrant spending more tokens on instructions. Routine tasks like data formatting can operate with leaner prompts.

These are not abstract concerns. They are the engineering trade-offs that determine whether an agent costs five cents per interaction or fifty, whether it responds in two seconds or twenty, and whether it produces reliable results or intermittent failures that erode user trust.

The Compound Effect in Multi-Agent Systems

These constraints compound when multiple agents collaborate. In a multi-agent system where agents delegate tasks to one another, every handoff involves serializing context into tokens, transmitting it, and having the receiving agent parse it within its own context window. A three-agent pipeline where each agent passes its full context to the next does not just triple the token cost; it can produce an exponential increase in total tokens consumed as context accumulates across the chain. It also introduces information loss at every boundary, because each agent must decide what to include from the previous agent’s output and what to discard.

This is why thoughtful multi-agent architectures invest heavily in context compression and interface design. The contracts between agents – what information is passed, in what format, at what level of detail – are as important as the prompts that drive each individual agent. An orchestration layer that passes a 50,000-token document from a retrieval agent to an analysis agent to a summarization agent is doing something fundamentally different from one that has each agent produce a structured, compressed output that the next agent can consume efficiently. The second approach costs less, runs faster, and often produces better results because each agent receives focused, relevant input rather than a sprawling context dump.

Key Takeaways

Tokens are the atomic unit of cost, latency, and attention in every LLM-powered system; they are not merely a billing abstraction but the fundamental constraint that shapes how agents consume information, how quickly they respond, and how accurately they reason. The context window is a hard boundary on working memory that must accommodate system instructions, conversation history, retrieved data, tool definitions, and model output simultaneously, which means that “just throw everything at the model” is never a viable strategy and that agent memory architecture – retrieval, summarization, history management – is a core engineering discipline, not an optimization you add later. In-context learning gives agents their task-specific behavior but resets completely with every API call, making external memory systems and deliberate context management essential rather than optional. Every agent design decision, from prompt length to multi-agent handoff protocols, is ultimately a trade-off negotiation between these three forces, and the teams that build reliable production agents are the ones that understand the trade-offs explicitly rather than discovering them through escalating costs and degrading performance.