The first time you open a Claude Code or API usage dashboard and see a column labeled cached input tokens, the instinct is to assume it is a billing artifact — not real optimization. How do you cache something that needs full context on every turn?
That reaction is common even among people who have spent years in NLP. Prompt caching is not magic. It is a direct consequence of how decoder-only transformers compute attention during inference — and for almost everyone building multi-turn agents, it is an optimization you should enable aggressively.
This guide is a decision framework: when to cache what in your LLM application, with the security tradeoffs at each step. The technical framing draws on Andre Kreidemann's excellent write-up at kreidemann.com/blog/prompt-caching, extended here with ExplainX context on agent economics, provider mechanics, and production patterns.
TL;DR: What to Cache and When
| Prompt section | Cache? | Why |
|---|---|---|
| Tool definitions | Yes, always | Stable across users; no sensitive data to leak via shared prefix |
| System prompt | Yes, always | Same for all users → shared cache is desired; user-specific → prefixes diverge anyway |
| RAG / document context | Yes, if stable per session | Large prefixes that repeat across turns in the same conversation |
| Messages array (no PII) | Yes | Internal tools, dev environments, team-only apps |
| Messages array (sensitive PII) | Yes, with exposure check | Safe unless all three attack preconditions are true (see below) |
| Messages array (high-risk multi-tenant) | Yes, with cache salting | Inject server-side tenant ID at first message token |
Bottom line: cache system prompts and tools without hesitation. Cache conversation history for agent loops — that is where the money is. Salt messages only in the minority of setups where timing side channels are practically exploitable.
How LLM Inference Creates a Caching Opportunity
Every API request passes through two inference stages:
- Prefill — the model processes all input tokens at once, building internal representations.
- Decode — the model generates output tokens one at a time, each attending to everything before it.
For long prompts, prefill dominates total compute. A 100k-token prompt means 100k tokens of matrix work before the model produces a single output token — even though each individual decode step is more expensive per token, the aggregate prefill cost wins on large contexts.
During prefill, the model computes key-value (KV) pairs for each token. Each KV pair encodes what that token "knows" about every earlier token — the mechanism behind transformer attention.
Here is the insight that makes caching possible: in decoder-only models (Claude, GPT, and most modern LLMs), attention only flows forward. Token 50 attends to tokens 1–49, but tokens 1–49 never need to know about token 50. If you already computed KV pairs for tokens 1–49, and a new request starts with those same 49 tokens, the cached pairs are still valid. Prompt caching skips prefill for the matching prefix and only computes the suffix.
Without caching: [████████████████████] prefill every request
With caching: [████████ cached ████][██ new suffix ██] prefill only suffix
Prefix-Based Matching
Caching is prefix-based. The provider checks your prompt from token one forward. As long as tokens match sequentially, you get cache hits. The moment tokens diverge, matching stops.
Order matters. Identical content rearranged is a complete cache miss. Structure prompts with stable blocks first and dynamic blocks last:
[tool definitions] → [system prompt] → [conversation history] → [new user message]
stable stable growing prefix changes each turn
This structure is why multi-turn conversations and agentic workflows are the highest-ROI caching use case. On turn N, everything except the latest message is identical to turn N−1.
Why Agents Benefit Disproportionately
A typical agent API request stacks:
- Tool definitions (often 5k–20k tokens)
- System prompt (500–3k tokens)
- Full message history including prior tool calls
- The new user message or tool result at the end
In an agent loop with 10–15 internal tool calls before responding to the user, you send 10–15 requests where almost the entire prompt is a cache hit. Only the latest message suffix is new.
| Scenario | Stable prefix | Turns | Approx. prefill avoided |
|---|---|---|---|
| Simple chatbot | 8k tokens | 5 turns | ~40k tokens of redundant prefill |
| Coding agent (Claude Code class) | 40k tokens | 12 tool calls | ~480k tokens |
| RAG agent with cached docs | 60k tokens | 8 turns | ~480k tokens |
On Anthropic's April 2026 pricing, Claude Sonnet 4.6 charges $3 / 1M input but $0.30 / 1M cached reads — a 90% discount. The first request writes the cache; subsequent hits read at the lower rate. For a deeper dive on token economics and complementary optimizations, see our Caveman token compression guide.
Claude for Work
Use Claude as a thought partner for writing, research & decisions — no coding required. 2 live sessions with Yash Thakker.
Claude for Work is a 2-day live workshop on using Claude to supercharge your daily work — writing, research, analysis, and decision-making — without any coding required. Learn how to set up Claude Projects with custom instructions, run deep-research sprints, co-write documents that sound like you, and build repeatable prompt systems for your team. August 1–2, 2026. Hosted by Yash Thakker, founder of AISOLO Technologies, instructor to 350,000+ students.
Includes 1-year access to all session recordings, a personal prompt library, Discord community access, and a certificate of completion. No coding or technical background required. Designed for managers, marketers, founders, and writers.
Provider Mechanics: Anthropic vs OpenAI
Both major providers implement prefix caching, but the control surface differs:
| Aspect | Anthropic (Claude) | OpenAI |
|---|---|---|
| Control | Explicit cache_control breakpoints on content blocks | Automatic on eligible requests |
| Minimum size | Varies by model; typically 1,024+ tokens per cache block | Automatic caching above 1,024 tokens |
| Pricing | Separate cache write vs cache read rates | Cached input at reduced rate on supported models |
| TTL | Configurable (e.g. 5-minute vs 1-hour cache) | Provider-managed |
From an application architecture perspective, the rule is identical: stable content at the front, changing content at the end.
Anthropic example with explicit cache breakpoints:
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are a coding assistant...",
"cache_control": {"type": "ephemeral"},
}
],
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": conversation_history,
"cache_control": {"type": "ephemeral"},
},
{
"type": "text",
"text": new_user_message,
},
],
}
],
)
Place cache_control on the last block of content you want cached — the breakpoint marks "cache everything up to and including this block." Dynamic content goes after the breakpoint without cache markers.
For Claude-specific prompt structure patterns that maximize cache hits, see the master prompt engineering guide.
The Decision Framework
Layer 1: System Prompt and Tool Definitions — Cache Without Thinking
This is the simplest case.
If system prompts and tools are identical for all users, shared caching is exactly what you want. Every user hitting the same cached system prompt saves everyone money and latency. There is nothing sensitive to leak — the content is public to your application anyway.
If they contain user-specific data, prefixes diverge across users automatically. Different tokens mean different prefixes mean no cross-user cache hits. The timing attack requires a matching prefix; user-specific system prompts will not match another user's probes.
Verdict: cache system prompts and tool definitions. Always.
Layer 2: The Messages Array — Where Savings and Risk Live
The messages array is where multi-turn and agentic cost savings concentrate. It is also where user data lives — so this is the only layer where caching creates a potential security tradeoff.
Step 1: Does your messages array contain sensitive user data?
- No (internal tools, dev environments, team-only apps) → cache normally. Done.
- Yes (most user-facing applications) → proceed to Step 2.
Step 2: Is the timing side channel actually exploitable in your setup?
For the attack to work, all three conditions must be true simultaneously:
| Precondition | Safe if... |
|---|---|
| User can influence message content | Your server constructs prompts entirely; users cannot send probe prefixes |
| Rate limits are weak | Tight rate limits make the many requests needed to average out network noise impractical |
| User can observe timing | Response times are buried in multi-step agent latency (30s+ workflows) rather than mapping 1:1 to a single observable API call |
If any precondition is false, cache the messages array normally. The attack surface does not exist in practice for most production applications.
If all three are true — sensitive data, user-controlled prompts, loose rate limits, observable latency — use cache salting (below).
The Security Model: Timing Side Channels
A cached response returns faster than an uncached one. That latency difference is measurable. An attacker who can send requests and observe time-to-first-token can probe whether a particular prefix exists in the cache.
Concrete example: In a multi-tenant app, User A submits a form with name, email, and phone. That conversation gets cached in the messages array. An attacker crafts prompts trying to reconstruct those values as prefixes:
"name: John, email: john@..."→ fast response → cache hit"name: Jane, email: jane@..."→ slow response → cache miss
Repeat enough times and you can reconstruct what User A submitted.
Cross-Account vs Within-Account
Cross-account cache sharing is largely a solved problem. Research in 2025 demonstrated prompt reconstruction via KV-cache sharing (Wu et al., NDSS 2025), timing side channels on production services including Claude and Azure OpenAI (Song et al., IEEE 2025), and provider-wide auditing that led at least five providers to change implementations (Gu et al., ICML 2025). Major providers now scope caches to the account level.
Within your account — inside the application you are building — the theoretical risk remains if all three preconditions above are true.
For most applications, this sounds scarier than it is. Customer support bots with server-side prompt construction, agents with multi-second latency, and apps with standard API rate limits rarely satisfy all three conditions.
DIY Cache Salting for Multi-Tenant Apps
When you do need isolation, the fix is straightforward: inject a unique user identifier at the very beginning of the messages array, server-side, outside the user's control.
def build_messages(user_id: str, history: list, new_message: str) -> list:
tenant_salt = {
"role": "user",
"content": f"[session:{user_id}]",
}
return [tenant_salt, *history, {"role": "user", "content": new_message}]
Because caching is prefix-based, User A's messages start with [session:abc-123] and User B's with [session:def-456]. The cache diverges at token one. No timing signal across tenants on the sensitive portion.
Why salt messages, not the system prompt?
Salting at the start of messages preserves shared caching on everything before it — tool definitions and system prompts still cache across all users. You only isolate the portion that contains sensitive data.
You lose cross-user cache hits on the messages portion, but in most multi-turn conversations messages are user-specific anyway — you were not going to get those hits regardless.
Model visibility: The model sees the salt as the first message. Either instruct it to ignore metadata tokens in your system prompt, or make the salt useful — pass personalization context the model can actually use:
tenant_salt = {
"role": "user",
"content": f"[User context: plan=pro, locale=en-US, session={session_id}]",
}
Structuring Prompts for Maximum Cache Hits
Beyond security, prompt architecture determines how much you save:
Do
- Put static blocks first: tools → system → docs → history → new input
- Use consistent formatting across requests (whitespace changes break prefixes)
- Mark cache breakpoints on the last stable block before dynamic content
- Keep system prompts stable across sessions; inject user context in messages instead
Avoid
- Timestamps or random IDs in the system prompt (breaks cache for everyone)
- Reordering message history between turns
- Embedding dynamic content before static tool definitions
- Putting user-specific data in the system prompt when messages would work (loses cross-user system cache without gaining security)
For system prompt design patterns that pair well with caching, see What Is a System Prompt?.
Economics: When Caching Might Not Be Worth It
This framework focuses on multi-turn agents where caching the messages array is almost always net-positive. There are edge cases where cache writes cost more than skipping cache entirely — for example, a single-request workload with a large prefix that will never repeat within the cache TTL.
Evaluate your specific provider's write vs read pricing:
| Cost type | Typical pattern |
|---|---|
| Cache write | Full or elevated input price on first request |
| Cache read | ~90% discount on Anthropic; similar on OpenAI |
| Cache miss | Full input price, no write surcharge |
Rule of thumb: if a prefix repeats at least twice within the cache TTL, caching wins. Agent loops with 5+ turns on a stable prefix are clear wins. One-shot batch jobs with unique prefixes every call may not benefit.
Combine caching with other cost levers from our token economics guide: batch APIs (50% off on both vendors), model routing, and output compression.
Putting It Together: Agent Architecture Checklist
- Cache tool definitions and system prompt with explicit breakpoints (Anthropic) or stable prefix structure (OpenAI).
- Grow conversation history at the end — never insert dynamic content mid-prefix.
- Run the three-factor exposure test on your messages array; salt if all preconditions are true.
- Monitor cache hit rates in your provider dashboard — cached input tokens should climb as sessions progress.
- Watch cache TTL — Anthropic's 5-minute vs 1-hour settings matter for long sessions; expired caches re-write at full cost.
- Pair with context engineering — dynamic RAG context belongs after cached blocks, not mixed into system prompts. See context engineering patterns.
Related Reading
- Caveman skill: token economics and API pricing — complementary cost levers beyond caching
- What Is a System Prompt? Complete Guide — where caching meets system prompt design
- Master Prompt Engineering for Claude — cache-friendly XML structure patterns
- What Are LLM Tokens? — how tokenization affects prefix matching
- Claude Code Pricing Guide 2026 — real-world session costs with caching discounts
Sources
- Andre Kreidemann, Prompt Caching: Just do it (March 2026) — decision framework this guide extends
- Anthropic prompt caching documentation — cache_control breakpoints and pricing
- Anthropic Claude pricing — cache read/write rates (retrieved June 2026)
- OpenAI API pricing — cached input rates (retrieved June 2026)
- Gu, C. et al., "Auditing Prompt Caching in Language Model APIs," ICML 2025. arXiv:2502.07776
- Wu, G. et al., "I Know What You Asked: Prompt Leakage via KV-Cache Sharing in Multi-Tenant LLM Serving" (PROMPTPEEK), NDSS 2025
- Song, J. et al., "The Early Bird Catches the Leak: Unveiling Timing Side Channels in LLM Serving Systems," IEEE 2025. arXiv:2409.20002
- Luo, S. et al., "Shadow in the Cache: Unveiling and Mitigating Privacy Risks of KV-cache in LLM Inference," NDSS 2026. arXiv:2508.09442
Pricing, cache TTLs, and provider implementations are accurate as of June 2026 and may change. Verify current rates on official documentation before production deployment.