Module 0 · The AI Substrate

Model Tiers & Cost-Capability Spectrum

The LLM market has stratified into distinct tiers representing fundamentally different operational profiles — not a quality ladder. The architecture itself is where the real engineering happens.

Three Tiers
Frontier

Frontier Models

Claude Opus, GPT-4o, Gemini Ultra
Input cost
$10–30 / M tokens
Latency
2–15 seconds
Best suited for
Complex reasoning Legal review Financial analysis Multi-step synthesis
Mid-Tier

Mid-Tier Models

Claude Sonnet, GPT-4o mini, Gemini Pro
Input cost
$1–5 / M tokens
Latency
1–5 seconds
Best suited for
Extraction Summarization Classification Data transformation
Lightweight

Lightweight Models

Claude Haiku, Gemini Flash
Input cost
< $1 / M tokens
Latency
< 1 second
Best suited for
Intent detection Routing Entity extraction Template generation
Architecture Pattern
Multi-Model Routing
A single customer complaint workflow that routes each step to the minimum-capable model — using frontier only where it's needed.
1
Classify complaint type
Route to billing, technical, or escalation queue
Haiku $0.001
2
Extract key details
Pull order numbers, dates, amounts from complaint
Sonnet $0.02
3
Query knowledge base
Retrieve policies, prior cases, and resolutions
Sonnet $0.03
4
Draft resolution
Synthesize policy, history, and customer context into personalized response
Opus $0.15
All Frontier
$0.60
Every step uses Opus — paying the premium for routing and extraction that doesn't need it.
Multi-Model Routing
$0.20
Each step uses the minimum-capable model. Same quality where it matters, 3× cheaper overall.
Cost Reduction
Caching Strategies
The most cost-effective AI request is the one you never make. Caching can reduce effective costs by 50–75%.

Exact-Match Cache

If you've seen this exact input before, return the previous output. Works well for classification and extraction with recurring inputs.

Eliminates redundant calls

Prompt Cache

Provider caches processed prompt prefixes. Subsequent requests sharing the same system prompt and tool definitions get up to 90% input cost reduction.

Up to 90% input savings

Semantic Cache

Return cached responses for semantically similar inputs. "Reset my password" and "change my login credentials" hit the same cache entry.

30–60% fewer model calls
Combined impact on 100K daily queries
Without caching — 100% cost
100%
With all strategies — ~25% cost
25%