What is prompt caching in LLM APIs?

Prompt caching reuses key-value (KV) pairs computed during the prefill phase when a new request shares an identical token prefix with a prior request. Instead of re-processing every input token, the provider skips computation for the matching prefix and only runs prefill on the new suffix. Anthropic exposes this via cache_control breakpoints; OpenAI applies automatic caching on requests over 1,024 tokens. Cached reads typically cost 90% less than full-price input on Claude Sonnet-class models.

Why is prompt caching especially valuable for AI agents?

Agent loops send many sequential API calls where tool definitions, system prompts, and conversation history stay identical — only the latest message or tool result changes. Prefix caching means each turn hits cache on everything before the new suffix. A 10-step agent loop with a 50k-token stable prefix can avoid re-computing 500k tokens of prefill across the session, cutting both latency and billable input dramatically.

Is prompt caching safe for multi-tenant applications?

Cross-account cache sharing is largely mitigated — major providers scope caches to account level after 2025 security audits. Within a single account, a timing side channel can theoretically probe whether a prefix exists in cache. For most production apps, the attack requires user-controlled prompts, weak rate limits, and observable latency — all three at once. If you handle sensitive user data in a high-risk setup, salt the messages array with a server-side tenant identifier at the first token.

Should I cache the system prompt or the messages array?

Cache both when possible. System prompts and tool definitions are safe to cache aggressively — they are either identical across users or user-specific enough that cross-user cache hits would not occur anyway. The messages array is where multi-turn savings live but also where user data sits. Apply the three-factor exposure test (user prompt control, rate limits, observable timing) before deciding whether to salt messages for tenant isolation.

How do Anthropic and OpenAI prompt caching differ?

Anthropic gives explicit control through cache_control breakpoints on content blocks — you mark which sections to cache and pay separate write vs read rates. OpenAI caches automatically on eligible requests over 1,024 tokens with no manual configuration. Both are prefix-based: stable content at the front of the prompt, dynamic content at the end. The architectural principle is identical even when pricing and TTLs differ.

What is cache salting and when should I use it?

Cache salting injects a unique server-side identifier at the very start of the messages array — for example a tenant UUID the user cannot control. Because caching is prefix-based, this breaks shared cache entries between users at token one while preserving shared caching on system prompts and tool definitions that appear earlier in the request. Use it when your app handles sensitive data, users can craft probe prompts, and response timing is observable.

What is prompt caching in LLM APIs?

Prompt caching reuses key-value (KV) pairs computed during the prefill phase when a new request shares an identical token prefix with a prior request. Instead of re-processing every input token, the provider skips computation for the matching prefix and only runs prefill on the new suffix. Anthropic exposes this via cache_control breakpoints; OpenAI applies automatic caching on requests over 1,024 tokens. Cached reads typically cost 90% less than full-price input on Claude Sonnet-class models.

Why is prompt caching especially valuable for AI agents?

Agent loops send many sequential API calls where tool definitions, system prompts, and conversation history stay identical — only the latest message or tool result changes. Prefix caching means each turn hits cache on everything before the new suffix. A 10-step agent loop with a 50k-token stable prefix can avoid re-computing 500k tokens of prefill across the session, cutting both latency and billable input dramatically.

Is prompt caching safe for multi-tenant applications?

Cross-account cache sharing is largely mitigated — major providers scope caches to account level after 2025 security audits. Within a single account, a timing side channel can theoretically probe whether a prefix exists in cache. For most production apps, the attack requires user-controlled prompts, weak rate limits, and observable latency — all three at once. If you handle sensitive user data in a high-risk setup, salt the messages array with a server-side tenant identifier at the first token.

Should I cache the system prompt or the messages array?

Cache both when possible. System prompts and tool definitions are safe to cache aggressively — they are either identical across users or user-specific enough that cross-user cache hits would not occur anyway. The messages array is where multi-turn savings live but also where user data sits. Apply the three-factor exposure test (user prompt control, rate limits, observable timing) before deciding whether to salt messages for tenant isolation.

How do Anthropic and OpenAI prompt caching differ?

Anthropic gives explicit control through cache_control breakpoints on content blocks — you mark which sections to cache and pay separate write vs read rates. OpenAI caches automatically on eligible requests over 1,024 tokens with no manual configuration. Both are prefix-based: stable content at the front of the prompt, dynamic content at the end. The architectural principle is identical even when pricing and TTLs differ.

What is cache salting and when should I use it?

Cache salting injects a unique server-side identifier at the very start of the messages array — for example a tenant UUID the user cannot control. Because caching is prefix-based, this breaks shared cache entries between users at token one while preserving shared caching on system prompts and tool definitions that appear earlier in the request. Use it when your app handles sensitive data, users can craft probe prompts, and response timing is observable.

What is prompt caching in LLM APIs?

Prompt caching reuses key-value (KV) pairs computed during the prefill phase when a new request shares an identical token prefix with a prior request. Instead of re-processing every input token, the provider skips computation for the matching prefix and only runs prefill on the new suffix. Anthropic exposes this via cache_control breakpoints; OpenAI applies automatic caching on requests over 1,024 tokens. Cached reads typically cost 90% less than full-price input on Claude Sonnet-class models.

Why is prompt caching especially valuable for AI agents?

Agent loops send many sequential API calls where tool definitions, system prompts, and conversation history stay identical — only the latest message or tool result changes. Prefix caching means each turn hits cache on everything before the new suffix. A 10-step agent loop with a 50k-token stable prefix can avoid re-computing 500k tokens of prefill across the session, cutting both latency and billable input dramatically.

Is prompt caching safe for multi-tenant applications?

Cross-account cache sharing is largely mitigated — major providers scope caches to account level after 2025 security audits. Within a single account, a timing side channel can theoretically probe whether a prefix exists in cache. For most production apps, the attack requires user-controlled prompts, weak rate limits, and observable latency — all three at once. If you handle sensitive user data in a high-risk setup, salt the messages array with a server-side tenant identifier at the first token.

Should I cache the system prompt or the messages array?

Cache both when possible. System prompts and tool definitions are safe to cache aggressively — they are either identical across users or user-specific enough that cross-user cache hits would not occur anyway. The messages array is where multi-turn savings live but also where user data sits. Apply the three-factor exposure test (user prompt control, rate limits, observable timing) before deciding whether to salt messages for tenant isolation.

How do Anthropic and OpenAI prompt caching differ?

Anthropic gives explicit control through cache_control breakpoints on content blocks — you mark which sections to cache and pay separate write vs read rates. OpenAI caches automatically on eligible requests over 1,024 tokens with no manual configuration. Both are prefix-based: stable content at the front of the prompt, dynamic content at the end. The architectural principle is identical even when pricing and TTLs differ.

What is cache salting and when should I use it?

Cache salting injects a unique server-side identifier at the very start of the messages array — for example a tenant UUID the user cannot control. Because caching is prefix-based, this breaks shared cache entries between users at token one while preserving shared caching on system prompts and tool definitions that appear earlier in the request. Use it when your app handles sensitive data, users can craft probe prompts, and response timing is observable.

What is prompt caching in LLM APIs?

Prompt caching reuses key-value (KV) pairs computed during the prefill phase when a new request shares an identical token prefix with a prior request. Instead of re-processing every input token, the provider skips computation for the matching prefix and only runs prefill on the new suffix. Anthropic exposes this via cache_control breakpoints; OpenAI applies automatic caching on requests over 1,024 tokens. Cached reads typically cost 90% less than full-price input on Claude Sonnet-class models.

Why is prompt caching especially valuable for AI agents?

Agent loops send many sequential API calls where tool definitions, system prompts, and conversation history stay identical — only the latest message or tool result changes. Prefix caching means each turn hits cache on everything before the new suffix. A 10-step agent loop with a 50k-token stable prefix can avoid re-computing 500k tokens of prefill across the session, cutting both latency and billable input dramatically.

Is prompt caching safe for multi-tenant applications?

Cross-account cache sharing is largely mitigated — major providers scope caches to account level after 2025 security audits. Within a single account, a timing side channel can theoretically probe whether a prefix exists in cache. For most production apps, the attack requires user-controlled prompts, weak rate limits, and observable latency — all three at once. If you handle sensitive user data in a high-risk setup, salt the messages array with a server-side tenant identifier at the first token.

Should I cache the system prompt or the messages array?

Cache both when possible. System prompts and tool definitions are safe to cache aggressively — they are either identical across users or user-specific enough that cross-user cache hits would not occur anyway. The messages array is where multi-turn savings live but also where user data sits. Apply the three-factor exposure test (user prompt control, rate limits, observable timing) before deciding whether to salt messages for tenant isolation.

How do Anthropic and OpenAI prompt caching differ?

Anthropic gives explicit control through cache_control breakpoints on content blocks — you mark which sections to cache and pay separate write vs read rates. OpenAI caches automatically on eligible requests over 1,024 tokens with no manual configuration. Both are prefix-based: stable content at the front of the prompt, dynamic content at the end. The architectural principle is identical even when pricing and TTLs differ.

What is cache salting and when should I use it?

Cache salting injects a unique server-side identifier at the very start of the messages array — for example a tenant UUID the user cannot control. Because caching is prefix-based, this breaks shared cache entries between users at token one while preserving shared caching on system prompts and tool definitions that appear earlier in the request. Use it when your app handles sensitive data, users can craft probe prompts, and response timing is observable.

What is prompt caching in LLM APIs?

Prompt caching reuses key-value (KV) pairs computed during the prefill phase when a new request shares an identical token prefix with a prior request. Instead of re-processing every input token, the provider skips computation for the matching prefix and only runs prefill on the new suffix. Anthropic exposes this via cache_control breakpoints; OpenAI applies automatic caching on requests over 1,024 tokens. Cached reads typically cost 90% less than full-price input on Claude Sonnet-class models.

Why is prompt caching especially valuable for AI agents?

Agent loops send many sequential API calls where tool definitions, system prompts, and conversation history stay identical — only the latest message or tool result changes. Prefix caching means each turn hits cache on everything before the new suffix. A 10-step agent loop with a 50k-token stable prefix can avoid re-computing 500k tokens of prefill across the session, cutting both latency and billable input dramatically.

Is prompt caching safe for multi-tenant applications?

Cross-account cache sharing is largely mitigated — major providers scope caches to account level after 2025 security audits. Within a single account, a timing side channel can theoretically probe whether a prefix exists in cache. For most production apps, the attack requires user-controlled prompts, weak rate limits, and observable latency — all three at once. If you handle sensitive user data in a high-risk setup, salt the messages array with a server-side tenant identifier at the first token.

Should I cache the system prompt or the messages array?

Cache both when possible. System prompts and tool definitions are safe to cache aggressively — they are either identical across users or user-specific enough that cross-user cache hits would not occur anyway. The messages array is where multi-turn savings live but also where user data sits. Apply the three-factor exposure test (user prompt control, rate limits, observable timing) before deciding whether to salt messages for tenant isolation.

How do Anthropic and OpenAI prompt caching differ?

Anthropic gives explicit control through cache_control breakpoints on content blocks — you mark which sections to cache and pay separate write vs read rates. OpenAI caches automatically on eligible requests over 1,024 tokens with no manual configuration. Both are prefix-based: stable content at the front of the prompt, dynamic content at the end. The architectural principle is identical even when pricing and TTLs differ.

What is cache salting and when should I use it?

Cache salting injects a unique server-side identifier at the very start of the messages array — for example a tenant UUID the user cannot control. Because caching is prefix-based, this breaks shared cache entries between users at token one while preserving shared caching on system prompts and tool definitions that appear earlier in the request. Use it when your app handles sensitive data, users can craft probe prompts, and response timing is observable.

What is prompt caching in LLM APIs?

Prompt caching reuses key-value (KV) pairs computed during the prefill phase when a new request shares an identical token prefix with a prior request. Instead of re-processing every input token, the provider skips computation for the matching prefix and only runs prefill on the new suffix. Anthropic exposes this via cache_control breakpoints; OpenAI applies automatic caching on requests over 1,024 tokens. Cached reads typically cost 90% less than full-price input on Claude Sonnet-class models.

Why is prompt caching especially valuable for AI agents?

Agent loops send many sequential API calls where tool definitions, system prompts, and conversation history stay identical — only the latest message or tool result changes. Prefix caching means each turn hits cache on everything before the new suffix. A 10-step agent loop with a 50k-token stable prefix can avoid re-computing 500k tokens of prefill across the session, cutting both latency and billable input dramatically.

Is prompt caching safe for multi-tenant applications?

Cross-account cache sharing is largely mitigated — major providers scope caches to account level after 2025 security audits. Within a single account, a timing side channel can theoretically probe whether a prefix exists in cache. For most production apps, the attack requires user-controlled prompts, weak rate limits, and observable latency — all three at once. If you handle sensitive user data in a high-risk setup, salt the messages array with a server-side tenant identifier at the first token.

Should I cache the system prompt or the messages array?

Cache both when possible. System prompts and tool definitions are safe to cache aggressively — they are either identical across users or user-specific enough that cross-user cache hits would not occur anyway. The messages array is where multi-turn savings live but also where user data sits. Apply the three-factor exposure test (user prompt control, rate limits, observable timing) before deciding whether to salt messages for tenant isolation.

How do Anthropic and OpenAI prompt caching differ?

Anthropic gives explicit control through cache_control breakpoints on content blocks — you mark which sections to cache and pay separate write vs read rates. OpenAI caches automatically on eligible requests over 1,024 tokens with no manual configuration. Both are prefix-based: stable content at the front of the prompt, dynamic content at the end. The architectural principle is identical even when pricing and TTLs differ.

What is cache salting and when should I use it?

Cache salting injects a unique server-side identifier at the very start of the messages array — for example a tenant UUID the user cannot control. Because caching is prefix-based, this breaks shared cache entries between users at token one while preserving shared caching on system prompts and tool definitions that appear earlier in the request. Use it when your app handles sensitive data, users can craft probe prompts, and response timing is observable.

What is prompt caching in LLM APIs?

Prompt caching reuses key-value (KV) pairs computed during the prefill phase when a new request shares an identical token prefix with a prior request. Instead of re-processing every input token, the provider skips computation for the matching prefix and only runs prefill on the new suffix. Anthropic exposes this via cache_control breakpoints; OpenAI applies automatic caching on requests over 1,024 tokens. Cached reads typically cost 90% less than full-price input on Claude Sonnet-class models.

Why is prompt caching especially valuable for AI agents?

Agent loops send many sequential API calls where tool definitions, system prompts, and conversation history stay identical — only the latest message or tool result changes. Prefix caching means each turn hits cache on everything before the new suffix. A 10-step agent loop with a 50k-token stable prefix can avoid re-computing 500k tokens of prefill across the session, cutting both latency and billable input dramatically.

Is prompt caching safe for multi-tenant applications?

Cross-account cache sharing is largely mitigated — major providers scope caches to account level after 2025 security audits. Within a single account, a timing side channel can theoretically probe whether a prefix exists in cache. For most production apps, the attack requires user-controlled prompts, weak rate limits, and observable latency — all three at once. If you handle sensitive user data in a high-risk setup, salt the messages array with a server-side tenant identifier at the first token.

Should I cache the system prompt or the messages array?

Cache both when possible. System prompts and tool definitions are safe to cache aggressively — they are either identical across users or user-specific enough that cross-user cache hits would not occur anyway. The messages array is where multi-turn savings live but also where user data sits. Apply the three-factor exposure test (user prompt control, rate limits, observable timing) before deciding whether to salt messages for tenant isolation.

How do Anthropic and OpenAI prompt caching differ?

Anthropic gives explicit control through cache_control breakpoints on content blocks — you mark which sections to cache and pay separate write vs read rates. OpenAI caches automatically on eligible requests over 1,024 tokens with no manual configuration. Both are prefix-based: stable content at the front of the prompt, dynamic content at the end. The architectural principle is identical even when pricing and TTLs differ.

What is cache salting and when should I use it?

Cache salting injects a unique server-side identifier at the very start of the messages array — for example a tenant UUID the user cannot control. Because caching is prefix-based, this breaks shared cache entries between users at token one while preserving shared caching on system prompts and tool definitions that appear earlier in the request. Use it when your app handles sensitive data, users can craft probe prompts, and response timing is observable.

What is prompt caching in LLM APIs?

Prompt caching reuses key-value (KV) pairs computed during the prefill phase when a new request shares an identical token prefix with a prior request. Instead of re-processing every input token, the provider skips computation for the matching prefix and only runs prefill on the new suffix. Anthropic exposes this via cache_control breakpoints; OpenAI applies automatic caching on requests over 1,024 tokens. Cached reads typically cost 90% less than full-price input on Claude Sonnet-class models.

Why is prompt caching especially valuable for AI agents?

Agent loops send many sequential API calls where tool definitions, system prompts, and conversation history stay identical — only the latest message or tool result changes. Prefix caching means each turn hits cache on everything before the new suffix. A 10-step agent loop with a 50k-token stable prefix can avoid re-computing 500k tokens of prefill across the session, cutting both latency and billable input dramatically.

Is prompt caching safe for multi-tenant applications?

Cross-account cache sharing is largely mitigated — major providers scope caches to account level after 2025 security audits. Within a single account, a timing side channel can theoretically probe whether a prefix exists in cache. For most production apps, the attack requires user-controlled prompts, weak rate limits, and observable latency — all three at once. If you handle sensitive user data in a high-risk setup, salt the messages array with a server-side tenant identifier at the first token.

Should I cache the system prompt or the messages array?

Cache both when possible. System prompts and tool definitions are safe to cache aggressively — they are either identical across users or user-specific enough that cross-user cache hits would not occur anyway. The messages array is where multi-turn savings live but also where user data sits. Apply the three-factor exposure test (user prompt control, rate limits, observable timing) before deciding whether to salt messages for tenant isolation.

How do Anthropic and OpenAI prompt caching differ?

Anthropic gives explicit control through cache_control breakpoints on content blocks — you mark which sections to cache and pay separate write vs read rates. OpenAI caches automatically on eligible requests over 1,024 tokens with no manual configuration. Both are prefix-based: stable content at the front of the prompt, dynamic content at the end. The architectural principle is identical even when pricing and TTLs differ.

What is cache salting and when should I use it?

Cache salting injects a unique server-side identifier at the very start of the messages array — for example a tenant UUID the user cannot control. Because caching is prefix-based, this breaks shared cache entries between users at token one while preserving shared caching on system prompts and tool definitions that appear earlier in the request. Use it when your app handles sensitive data, users can craft probe prompts, and response timing is observable.

What is prompt caching in LLM APIs?

Prompt caching reuses key-value (KV) pairs computed during the prefill phase when a new request shares an identical token prefix with a prior request. Instead of re-processing every input token, the provider skips computation for the matching prefix and only runs prefill on the new suffix. Anthropic exposes this via cache_control breakpoints; OpenAI applies automatic caching on requests over 1,024 tokens. Cached reads typically cost 90% less than full-price input on Claude Sonnet-class models.

Why is prompt caching especially valuable for AI agents?

Agent loops send many sequential API calls where tool definitions, system prompts, and conversation history stay identical — only the latest message or tool result changes. Prefix caching means each turn hits cache on everything before the new suffix. A 10-step agent loop with a 50k-token stable prefix can avoid re-computing 500k tokens of prefill across the session, cutting both latency and billable input dramatically.

Is prompt caching safe for multi-tenant applications?

Cross-account cache sharing is largely mitigated — major providers scope caches to account level after 2025 security audits. Within a single account, a timing side channel can theoretically probe whether a prefix exists in cache. For most production apps, the attack requires user-controlled prompts, weak rate limits, and observable latency — all three at once. If you handle sensitive user data in a high-risk setup, salt the messages array with a server-side tenant identifier at the first token.

Prompt Caching: LLM Cost & Security Decision Framework 2026 | explainx.ai Blog

The first time you open a Claude Code or API usage dashboard and see a column labeled cached input tokens, the instinct is to assume it is a billing artifact — not real optimization. How do you cache something that needs full context on every turn?

That reaction is common even among people who have spent years in NLP. Prompt caching is not magic. It is a direct consequence of how decoder-only transformers compute attention during inference — and for almost everyone building multi-turn agents, it is an optimization you should enable aggressively.

This guide is a decision framework: when to cache what in your LLM application, with the security tradeoffs at each step. The technical framing draws on Andre Kreidemann's excellent write-up at kreidemann.com/blog/prompt-caching, extended here with ExplainX context on agent economics, provider mechanics, and production patterns.

TL;DR: What to Cache and When

Prompt section	Cache?	Why
Tool definitions	Yes, always	Stable across users; no sensitive data to leak via shared prefix
System prompt	Yes, always	Same for all users → shared cache is desired; user-specific → prefixes diverge anyway
RAG / document context	Yes, if stable per session	Large prefixes that repeat across turns in the same conversation
Messages array (no PII)	Yes	Internal tools, dev environments, team-only apps
Messages array (sensitive PII)	Yes, with exposure check	Safe unless all three attack preconditions are true (see below)
Messages array (high-risk multi-tenant)	Yes, with cache salting	Inject server-side tenant ID at first message token

Bottom line: cache system prompts and tools without hesitation. Cache conversation history for agent loops — that is where the money is. Salt messages only in the minority of setups where timing side channels are practically exploitable.

How LLM Inference Creates a Caching Opportunity

Every API request passes through two inference stages:

Prefill — the model processes all input tokens at once, building internal representations.
Decode — the model generates output tokens one at a time, each attending to everything before it.

For long prompts, prefill dominates total compute. A 100k-token prompt means 100k tokens of matrix work before the model produces a single output token — even though each individual decode step is more expensive per token, the aggregate prefill cost wins on large contexts.

During prefill, the model computes key-value (KV) pairs for each token. Each KV pair encodes what that token "knows" about every earlier token — the mechanism behind transformer attention.

Here is the insight that makes caching possible: in decoder-only models (Claude, GPT, and most modern LLMs), attention only flows forward. Token 50 attends to tokens 1–49, but tokens 1–49 never need to know about token 50. If you already computed KV pairs for tokens 1–49, and a new request starts with those same 49 tokens, the cached pairs are still valid. Prompt caching skips prefill for the matching prefix and only computes the suffix.

Without caching:  [████████████████████] prefill every request
With caching:     [████████ cached ████][██ new suffix ██] prefill only suffix

Prefix-Based Matching

Caching is prefix-based. The provider checks your prompt from token one forward. As long as tokens match sequentially, you get cache hits. The moment tokens diverge, matching stops.

Order matters. Identical content rearranged is a complete cache miss. Structure prompts with stable blocks first and dynamic blocks last:

[tool definitions] → [system prompt] → [conversation history] → [new user message]
     stable              stable              growing prefix           changes each turn

This structure is why multi-turn conversations and agentic workflows are the highest-ROI caching use case. On turn N, everything except the latest message is identical to turn N−1.

Why prefix caching changes agent economics — and why retrieval still matters for dynamic knowledge.

Why Agents Benefit Disproportionately

A typical agent API request stacks:

Tool definitions (often 5k–20k tokens)
System prompt (500–3k tokens)
Full message history including prior tool calls
The new user message or tool result at the end

In an agent loop with 10–15 internal tool calls before responding to the user, you send 10–15 requests where almost the entire prompt is a cache hit. Only the latest message suffix is new.

Scenario	Stable prefix	Turns	Approx. prefill avoided
Simple chatbot	8k tokens	5 turns	~40k tokens of redundant prefill
Coding agent (Claude Code class)	40k tokens	12 tool calls	~480k tokens
RAG agent with cached docs	60k tokens	8 turns	~480k tokens

On Anthropic's April 2026 pricing, Claude Sonnet 4.6 charges $3 / 1M input but $0.30 / 1M cached reads — a 90% discount. The first request writes the cache; subsequent hits read at the lower rate. For a deeper dive on token economics and complementary optimizations, see our Caveman token compression guide.

Live WorkshopAug 1–2, 2026 · 2 days

Claude for Work

Use Claude as a thought partner for writing, research & decisions — no coding required. 2 live sessions with Yash Thakker.

Provider Mechanics: Anthropic vs OpenAI

Both major providers implement prefix caching, but the control surface differs:

Aspect	Anthropic (Claude)	OpenAI
Control	Explicit `cache_control` breakpoints on content blocks	Automatic on eligible requests
Minimum size	Varies by model; typically 1,024+ tokens per cache block	Automatic caching above 1,024 tokens
Pricing	Separate cache write vs cache read rates	Cached input at reduced rate on supported models
TTL	Configurable (e.g. 5-minute vs 1-hour cache)	Provider-managed

From an application architecture perspective, the rule is identical: stable content at the front, changing content at the end.

Anthropic example with explicit cache breakpoints:

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are a coding assistant...",
            "cache_control": {"type": "ephemeral"},
        }
    ],
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": conversation_history,
                    "cache_control": {"type": "ephemeral"},
                },
                {
                    "type": "text",
                    "text": new_user_message,
                },
            ],
        }
    ],
)

Place cache_control on the last block of content you want cached — the breakpoint marks "cache everything up to and including this block." Dynamic content goes after the breakpoint without cache markers.

For Claude-specific prompt structure patterns that maximize cache hits, see the master prompt engineering guide.

The Decision Framework

Layer 1: System Prompt and Tool Definitions — Cache Without Thinking

This is the simplest case.

If system prompts and tools are identical for all users, shared caching is exactly what you want. Every user hitting the same cached system prompt saves everyone money and latency. There is nothing sensitive to leak — the content is public to your application anyway.

If they contain user-specific data, prefixes diverge across users automatically. Different tokens mean different prefixes mean no cross-user cache hits. The timing attack requires a matching prefix; user-specific system prompts will not match another user's probes.

Verdict: cache system prompts and tool definitions. Always.

Layer 2: The Messages Array — Where Savings and Risk Live

The messages array is where multi-turn and agentic cost savings concentrate. It is also where user data lives — so this is the only layer where caching creates a potential security tradeoff.

Step 1: Does your messages array contain sensitive user data?

No (internal tools, dev environments, team-only apps) → cache normally. Done.
Yes (most user-facing applications) → proceed to Step 2.

Step 2: Is the timing side channel actually exploitable in your setup?

For the attack to work, all three conditions must be true simultaneously:

Precondition	Safe if...
User can influence message content	Your server constructs prompts entirely; users cannot send probe prefixes
Rate limits are weak	Tight rate limits make the many requests needed to average out network noise impractical
User can observe timing	Response times are buried in multi-step agent latency (30s+ workflows) rather than mapping 1:1 to a single observable API call

If any precondition is false, cache the messages array normally. The attack surface does not exist in practice for most production applications.

If all three are true — sensitive data, user-controlled prompts, loose rate limits, observable latency — use cache salting (below).

The Security Model: Timing Side Channels

A cached response returns faster than an uncached one. That latency difference is measurable. An attacker who can send requests and observe time-to-first-token can probe whether a particular prefix exists in the cache.

Concrete example: In a multi-tenant app, User A submits a form with name, email, and phone. That conversation gets cached in the messages array. An attacker crafts prompts trying to reconstruct those values as prefixes:

"name: John, email: john@..." → fast response → cache hit
"name: Jane, email: jane@..." → slow response → cache miss

Repeat enough times and you can reconstruct what User A submitted.

Cross-Account vs Within-Account

Cross-account cache sharing is largely a solved problem. Research in 2025 demonstrated prompt reconstruction via KV-cache sharing (Wu et al., NDSS 2025), timing side channels on production services including Claude and Azure OpenAI (Song et al., IEEE 2025), and provider-wide auditing that led at least five providers to change implementations (Gu et al., ICML 2025). Major providers now scope caches to the account level.

Within your account — inside the application you are building — the theoretical risk remains if all three preconditions above are true.

For most applications, this sounds scarier than it is. Customer support bots with server-side prompt construction, agents with multi-second latency, and apps with standard API rate limits rarely satisfy all three conditions.

DIY Cache Salting for Multi-Tenant Apps

When you do need isolation, the fix is straightforward: inject a unique user identifier at the very beginning of the messages array, server-side, outside the user's control.

def build_messages(user_id: str, history: list, new_message: str) -> list:
    tenant_salt = {
        "role": "user",
        "content": f"[session:{user_id}]",
    }
    return [tenant_salt, *history, {"role": "user", "content": new_message}]

Because caching is prefix-based, User A's messages start with [session:abc-123] and User B's with [session:def-456]. The cache diverges at token one. No timing signal across tenants on the sensitive portion.

Why salt messages, not the system prompt?

Salting at the start of messages preserves shared caching on everything before it — tool definitions and system prompts still cache across all users. You only isolate the portion that contains sensitive data.

You lose cross-user cache hits on the messages portion, but in most multi-turn conversations messages are user-specific anyway — you were not going to get those hits regardless.

Model visibility: The model sees the salt as the first message. Either instruct it to ignore metadata tokens in your system prompt, or make the salt useful — pass personalization context the model can actually use:

tenant_salt = {
    "role": "user",
    "content": f"[User context: plan=pro, locale=en-US, session={session_id}]",
}

Structuring Prompts for Maximum Cache Hits

Beyond security, prompt architecture determines how much you save:

Do

Put static blocks first: tools → system → docs → history → new input
Use consistent formatting across requests (whitespace changes break prefixes)
Mark cache breakpoints on the last stable block before dynamic content
Keep system prompts stable across sessions; inject user context in messages instead

Avoid

Timestamps or random IDs in the system prompt (breaks cache for everyone)
Reordering message history between turns
Embedding dynamic content before static tool definitions
Putting user-specific data in the system prompt when messages would work (loses cross-user system cache without gaining security)

For system prompt design patterns that pair well with caching, see What Is a System Prompt?.

Economics: When Caching Might Not Be Worth It

This framework focuses on multi-turn agents where caching the messages array is almost always net-positive. There are edge cases where cache writes cost more than skipping cache entirely — for example, a single-request workload with a large prefix that will never repeat within the cache TTL.

Evaluate your specific provider's write vs read pricing:

Cost type	Typical pattern
Cache write	Full or elevated input price on first request
Cache read	~90% discount on Anthropic; similar on OpenAI
Cache miss	Full input price, no write surcharge

Rule of thumb: if a prefix repeats at least twice within the cache TTL, caching wins. Agent loops with 5+ turns on a stable prefix are clear wins. One-shot batch jobs with unique prefixes every call may not benefit.

Combine caching with other cost levers from our token economics guide: batch APIs (50% off on both vendors), model routing, and output compression.

Putting It Together: Agent Architecture Checklist

Cache tool definitions and system prompt with explicit breakpoints (Anthropic) or stable prefix structure (OpenAI).
Grow conversation history at the end — never insert dynamic content mid-prefix.
Run the three-factor exposure test on your messages array; salt if all preconditions are true.
Monitor cache hit rates in your provider dashboard — cached input tokens should climb as sessions progress.
Watch cache TTL — Anthropic's 5-minute vs 1-hour settings matter for long sessions; expired caches re-write at full cost.
Pair with context engineering — dynamic RAG context belongs after cached blocks, not mixed into system prompts. See context engineering patterns.

Sources

Andre Kreidemann, Prompt Caching: Just do it (March 2026) — decision framework this guide extends
Anthropic prompt caching documentation — cache_control breakpoints and pricing
Anthropic Claude pricing — cache read/write rates (retrieved June 2026)
OpenAI API pricing — cached input rates (retrieved June 2026)
Gu, C. et al., "Auditing Prompt Caching in Language Model APIs," ICML 2025. arXiv:2502.07776
Wu, G. et al., "I Know What You Asked: Prompt Leakage via KV-Cache Sharing in Multi-Tenant LLM Serving" (PROMPTPEEK), NDSS 2025
Song, J. et al., "The Early Bird Catches the Leak: Unveiling Timing Side Channels in LLM Serving Systems," IEEE 2025. arXiv:2409.20002
Luo, S. et al., "Shadow in the Cache: Unveiling and Mitigating Privacy Risks of KV-cache in LLM Inference," NDSS 2026. arXiv:2508.09442

Pricing, cache TTLs, and provider implementations are accurate as of June 2026 and may change. Verify current rates on official documentation before production deployment.

TL;DR: What to Cache and When

Prompt section	Cache?	Why
Tool definitions	Yes, always	Stable across users; no sensitive data to leak via shared prefix
System prompt	Yes, always	Same for all users → shared cache is desired; user-specific → prefixes diverge anyway
RAG / document context	Yes, if stable per session	Large prefixes that repeat across turns in the same conversation
Messages array (no PII)	Yes	Internal tools, dev environments, team-only apps
Messages array (sensitive PII)	Yes, with exposure check	Safe unless all three attack preconditions are true (see below)
Messages array (high-risk multi-tenant)	Yes, with cache salting	Inject server-side tenant ID at first message token

How LLM Inference Creates a Caching Opportunity

Every API request passes through two inference stages:

Prefill — the model processes all input tokens at once, building internal representations.
Decode — the model generates output tokens one at a time, each attending to everything before it.

During prefill, the model computes key-value (KV) pairs for each token. Each KV pair encodes what that token "knows" about every earlier token — the mechanism behind transformer attention.

Without caching:  [████████████████████] prefill every request
With caching:     [████████ cached ████][██ new suffix ██] prefill only suffix

Prefix-Based Matching

Caching is prefix-based. The provider checks your prompt from token one forward. As long as tokens match sequentially, you get cache hits. The moment tokens diverge, matching stops.

Order matters. Identical content rearranged is a complete cache miss. Structure prompts with stable blocks first and dynamic blocks last:

[tool definitions] → [system prompt] → [conversation history] → [new user message]
     stable              stable              growing prefix           changes each turn

This structure is why multi-turn conversations and agentic workflows are the highest-ROI caching use case. On turn N, everything except the latest message is identical to turn N−1.

Why prefix caching changes agent economics — and why retrieval still matters for dynamic knowledge.

Why Agents Benefit Disproportionately

A typical agent API request stacks:

Tool definitions (often 5k–20k tokens)
System prompt (500–3k tokens)
Full message history including prior tool calls
The new user message or tool result at the end

In an agent loop with 10–15 internal tool calls before responding to the user, you send 10–15 requests where almost the entire prompt is a cache hit. Only the latest message suffix is new.

Scenario	Stable prefix	Turns	Approx. prefill avoided
Simple chatbot	8k tokens	5 turns	~40k tokens of redundant prefill
Coding agent (Claude Code class)	40k tokens	12 tool calls	~480k tokens
RAG agent with cached docs	60k tokens	8 turns	~480k tokens

Live WorkshopAug 1–2, 2026 · 2 days

Claude for Work

Use Claude as a thought partner for writing, research & decisions — no coding required. 2 live sessions with Yash Thakker.

Provider Mechanics: Anthropic vs OpenAI

Both major providers implement prefix caching, but the control surface differs:

Aspect	Anthropic (Claude)	OpenAI
Control	Explicit `cache_control` breakpoints on content blocks	Automatic on eligible requests
Minimum size	Varies by model; typically 1,024+ tokens per cache block	Automatic caching above 1,024 tokens
Pricing	Separate cache write vs cache read rates	Cached input at reduced rate on supported models
TTL	Configurable (e.g. 5-minute vs 1-hour cache)	Provider-managed

From an application architecture perspective, the rule is identical: stable content at the front, changing content at the end.

Anthropic example with explicit cache breakpoints:

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are a coding assistant...",
            "cache_control": {"type": "ephemeral"},
        }
    ],
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": conversation_history,
                    "cache_control": {"type": "ephemeral"},
                },
                {
                    "type": "text",
                    "text": new_user_message,
                },
            ],
        }
    ],
)

For Claude-specific prompt structure patterns that maximize cache hits, see the master prompt engineering guide.

The Decision Framework

Layer 1: System Prompt and Tool Definitions — Cache Without Thinking

This is the simplest case.

Verdict: cache system prompts and tool definitions. Always.

Layer 2: The Messages Array — Where Savings and Risk Live

The messages array is where multi-turn and agentic cost savings concentrate. It is also where user data lives — so this is the only layer where caching creates a potential security tradeoff.

Step 1: Does your messages array contain sensitive user data?

No (internal tools, dev environments, team-only apps) → cache normally. Done.
Yes (most user-facing applications) → proceed to Step 2.

Step 2: Is the timing side channel actually exploitable in your setup?

For the attack to work, all three conditions must be true simultaneously:

Precondition	Safe if...
User can influence message content	Your server constructs prompts entirely; users cannot send probe prefixes
Rate limits are weak	Tight rate limits make the many requests needed to average out network noise impractical
User can observe timing	Response times are buried in multi-step agent latency (30s+ workflows) rather than mapping 1:1 to a single observable API call

If any precondition is false, cache the messages array normally. The attack surface does not exist in practice for most production applications.

If all three are true — sensitive data, user-controlled prompts, loose rate limits, observable latency — use cache salting (below).

The Security Model: Timing Side Channels

"name: John, email: john@..." → fast response → cache hit
"name: Jane, email: jane@..." → slow response → cache miss

Repeat enough times and you can reconstruct what User A submitted.

Cross-Account vs Within-Account

Within your account — inside the application you are building — the theoretical risk remains if all three preconditions above are true.

DIY Cache Salting for Multi-Tenant Apps

When you do need isolation, the fix is straightforward: inject a unique user identifier at the very beginning of the messages array, server-side, outside the user's control.

def build_messages(user_id: str, history: list, new_message: str) -> list:
    tenant_salt = {
        "role": "user",
        "content": f"[session:{user_id}]",
    }
    return [tenant_salt, *history, {"role": "user", "content": new_message}]

Why salt messages, not the system prompt?

You lose cross-user cache hits on the messages portion, but in most multi-turn conversations messages are user-specific anyway — you were not going to get those hits regardless.

tenant_salt = {
    "role": "user",
    "content": f"[User context: plan=pro, locale=en-US, session={session_id}]",
}

Structuring Prompts for Maximum Cache Hits

Beyond security, prompt architecture determines how much you save:

Do

Put static blocks first: tools → system → docs → history → new input
Use consistent formatting across requests (whitespace changes break prefixes)
Mark cache breakpoints on the last stable block before dynamic content
Keep system prompts stable across sessions; inject user context in messages instead

Avoid

Timestamps or random IDs in the system prompt (breaks cache for everyone)
Reordering message history between turns
Embedding dynamic content before static tool definitions
Putting user-specific data in the system prompt when messages would work (loses cross-user system cache without gaining security)

For system prompt design patterns that pair well with caching, see What Is a System Prompt?.

Economics: When Caching Might Not Be Worth It

Evaluate your specific provider's write vs read pricing:

Cost type	Typical pattern
Cache write	Full or elevated input price on first request
Cache read	~90% discount on Anthropic; similar on OpenAI
Cache miss	Full input price, no write surcharge

Combine caching with other cost levers from our token economics guide: batch APIs (50% off on both vendors), model routing, and output compression.

Putting It Together: Agent Architecture Checklist

Cache tool definitions and system prompt with explicit breakpoints (Anthropic) or stable prefix structure (OpenAI).
Grow conversation history at the end — never insert dynamic content mid-prefix.
Run the three-factor exposure test on your messages array; salt if all preconditions are true.
Monitor cache hit rates in your provider dashboard — cached input tokens should climb as sessions progress.
Watch cache TTL — Anthropic's 5-minute vs 1-hour settings matter for long sessions; expired caches re-write at full cost.
Pair with context engineering — dynamic RAG context belongs after cached blocks, not mixed into system prompts. See context engineering patterns.

Sources

Andre Kreidemann, Prompt Caching: Just do it (March 2026) — decision framework this guide extends
Anthropic prompt caching documentation — cache_control breakpoints and pricing
Anthropic Claude pricing — cache read/write rates (retrieved June 2026)
OpenAI API pricing — cached input rates (retrieved June 2026)
Gu, C. et al., "Auditing Prompt Caching in Language Model APIs," ICML 2025. arXiv:2502.07776
Wu, G. et al., "I Know What You Asked: Prompt Leakage via KV-Cache Sharing in Multi-Tenant LLM Serving" (PROMPTPEEK), NDSS 2025
Song, J. et al., "The Early Bird Catches the Leak: Unveiling Timing Side Channels in LLM Serving Systems," IEEE 2025. arXiv:2409.20002
Luo, S. et al., "Shadow in the Cache: Unveiling and Mitigating Privacy Risks of KV-cache in LLM Inference," NDSS 2026. arXiv:2508.09442

Pricing, cache TTLs, and provider implementations are accurate as of June 2026 and may change. Verify current rates on official documentation before production deployment.

TL;DR: What to Cache and When

How LLM Inference Creates a Caching Opportunity

Prefix-Based Matching

Why Agents Benefit Disproportionately

Provider Mechanics: Anthropic vs OpenAI

The Decision Framework

Layer 1: System Prompt and Tool Definitions — Cache Without Thinking

Layer 2: The Messages Array — Where Savings and Risk Live

The Security Model: Timing Side Channels

Cross-Account vs Within-Account

DIY Cache Salting for Multi-Tenant Apps

Structuring Prompts for Maximum Cache Hits

Do

Avoid

Economics: When Caching Might Not Be Worth It

Putting It Together: Agent Architecture Checklist

Related Reading

Sources

Related posts

Caveman skill: token economics, API pricing, and cutting verbose LLM output in agents

Why Every AI Company Wants You Using Agents: The Token Economics Nobody Talks About

Microsoft SkillOpt: The Self-Evolving Agent That Trains Documents, Not Models (52/52 Wins)

TL;DR: What to Cache and When

How LLM Inference Creates a Caching Opportunity

Prefix-Based Matching

Why Agents Benefit Disproportionately

Provider Mechanics: Anthropic vs OpenAI

The Decision Framework

Layer 1: System Prompt and Tool Definitions — Cache Without Thinking

Layer 2: The Messages Array — Where Savings and Risk Live

The Security Model: Timing Side Channels

Cross-Account vs Within-Account

DIY Cache Salting for Multi-Tenant Apps

Structuring Prompts for Maximum Cache Hits

Do

Avoid

Economics: When Caching Might Not Be Worth It

Putting It Together: Agent Architecture Checklist

Related Reading

Sources

Related posts

Caveman skill: token economics, API pricing, and cutting verbose LLM output in agents

Why Every AI Company Wants You Using Agents: The Token Economics Nobody Talks About

Microsoft SkillOpt: The Self-Evolving Agent That Trains Documents, Not Models (52/52 Wins)