Three Tiers
Frontier Models
Claude Opus, GPT-4o, Gemini Ultra
Input cost
$10–30 / M tokens
Best suited for
Complex reasoning
Legal review
Financial analysis
Multi-step synthesis
Mid-Tier Models
Claude Sonnet, GPT-4o mini, Gemini Pro
Input cost
$1–5 / M tokens
Best suited for
Extraction
Summarization
Classification
Data transformation
Lightweight Models
Claude Haiku, Gemini Flash
Input cost
< $1 / M tokens
Best suited for
Intent detection
Routing
Entity extraction
Template generation
Architecture Pattern
Multi-Model Routing
A single customer complaint workflow that routes each step to the minimum-capable model — using frontier only where it's needed.
1
Classify complaint type
Route to billing, technical, or escalation queue
Haiku
$0.001
2
Extract key details
Pull order numbers, dates, amounts from complaint
Sonnet
$0.02
3
Query knowledge base
Retrieve policies, prior cases, and resolutions
Sonnet
$0.03
4
Draft resolution
Synthesize policy, history, and customer context into personalized response
Opus
$0.15
All Frontier
$0.60
Every step uses Opus — paying the premium for routing and extraction that doesn't need it.
Multi-Model Routing
$0.20
Each step uses the minimum-capable model. Same quality where it matters, 3× cheaper overall.
Cost Reduction
Caching Strategies
The most cost-effective AI request is the one you never make. Caching can reduce effective costs by 50–75%.
Exact-Match Cache
If you've seen this exact input before, return the previous output. Works well for classification and extraction with recurring inputs.
Eliminates redundant calls
Prompt Cache
Provider caches processed prompt prefixes. Subsequent requests sharing the same system prompt and tool definitions get up to 90% input cost reduction.
Up to 90% input savings
Semantic Cache
Return cached responses for semantically similar inputs. "Reset my password" and "change my login credentials" hit the same cache entry.
30–60% fewer model calls
Combined impact on 100K daily queries
Without caching — 100% cost
With all strategies — ~25% cost