explainx.ainewsletter3.4k
trending🔥loopsskills
pricing
workshops ↗
explainx.ai

Learn to lead teams that combine humans and agents. Platform access, live workshops, bootcamps, and 50+ courses — plus skills, tools, and MCP to practice what you learn.

follow us

custom AI agents

[email protected]

get started

Join · $29/moUpcoming workshop

learn

platform · $29/moupcoming workshopworkshopsbootcampscoursescertificationscertification testsexplainx universitycorporate trainingfacilitatorshackathonslearn skills & mcp

discover

skillstoolsagentsmcp serversdesignsllmsagiranks

content

releasesvisionmissionaboutteamcareersresourcespromptsgenerators hubgenerator SEO hubprompt templatesprompt guidesblogfor LLMsdemo

Sister Products

Infloq

Infloq

Influencer marketing

BgBlur

BgBlur

Privacy-first blur

Olly Social

Olly Social

Social AI copilot

Ceptory

Ceptory

Video intelligence

BgRemover

BgRemover

Background removal

newsletter · weekly

Get AI news, tools, and insights in your inbox.

contactsupportprivacytermsdata rightssubmission guidelines

© 2026 AISOLO Technologies Pvt Ltd

← Back to blog

explainx / blog

Prompt Caching: Decision Framework for LLM Cost, Latency, and Security (2026)

Prompt caching skips redundant prefill on unchanged prompt prefixes — cutting agent costs up to 90%. This guide explains how KV-cache reuse works, when to cache system prompts vs messages, and how to salt multi-tenant apps safely.

Jun 23, 2026·12 min read·Yash Thakker
Prompt CachingLLM OptimizationAI AgentsToken EconomicsAI Development
Prompt Caching: Decision Framework for LLM Cost, Latency, and Security (2026)

The first time you open a Claude Code or API usage dashboard and see a column labeled cached input tokens, the instinct is to assume it is a billing artifact — not real optimization. How do you cache something that needs full context on every turn?

That reaction is common even among people who have spent years in NLP. Prompt caching is not magic. It is a direct consequence of how decoder-only transformers compute attention during inference — and for almost everyone building multi-turn agents, it is an optimization you should enable aggressively.

This guide is a decision framework: when to cache what in your LLM application, with the security tradeoffs at each step. The technical framing draws on Andre Kreidemann's excellent write-up at kreidemann.com/blog/prompt-caching, extended here with ExplainX context on agent economics, provider mechanics, and production patterns.

newsletter3.4k

Curated AI updates on agents, skills, and MCP — delivered to your inbox. Unsubscribe anytime.

TL;DR: What to Cache and When

Prompt sectionCache?Why
Tool definitionsYes, alwaysStable across users; no sensitive data to leak via shared prefix
System promptYes, alwaysSame for all users → shared cache is desired; user-specific → prefixes diverge anyway
RAG / document contextYes, if stable per sessionLarge prefixes that repeat across turns in the same conversation
Messages array (no PII)YesInternal tools, dev environments, team-only apps
Messages array (sensitive PII)Yes, with exposure checkSafe unless all three attack preconditions are true (see below)
Messages array (high-risk multi-tenant)Yes, with cache saltingInject server-side tenant ID at first message token

Bottom line: cache system prompts and tools without hesitation. Cache conversation history for agent loops — that is where the money is. Salt messages only in the minority of setups where timing side channels are practically exploitable.

How LLM Inference Creates a Caching Opportunity

Every API request passes through two inference stages:

  1. Prefill — the model processes all input tokens at once, building internal representations.
  2. Decode — the model generates output tokens one at a time, each attending to everything before it.

For long prompts, prefill dominates total compute. A 100k-token prompt means 100k tokens of matrix work before the model produces a single output token — even though each individual decode step is more expensive per token, the aggregate prefill cost wins on large contexts.

During prefill, the model computes key-value (KV) pairs for each token. Each KV pair encodes what that token "knows" about every earlier token — the mechanism behind transformer attention.

Here is the insight that makes caching possible: in decoder-only models (Claude, GPT, and most modern LLMs), attention only flows forward. Token 50 attends to tokens 1–49, but tokens 1–49 never need to know about token 50. If you already computed KV pairs for tokens 1–49, and a new request starts with those same 49 tokens, the cached pairs are still valid. Prompt caching skips prefill for the matching prefix and only computes the suffix.

Without caching:  [████████████████████] prefill every request
With caching:     [████████ cached ████][██ new suffix ██] prefill only suffix

Prefix-Based Matching

Caching is prefix-based. The provider checks your prompt from token one forward. As long as tokens match sequentially, you get cache hits. The moment tokens diverge, matching stops.

Order matters. Identical content rearranged is a complete cache miss. Structure prompts with stable blocks first and dynamic blocks last:

[tool definitions] → [system prompt] → [conversation history] → [new user message]
     stable              stable              growing prefix           changes each turn

This structure is why multi-turn conversations and agentic workflows are the highest-ROI caching use case. On turn N, everything except the latest message is identical to turn N−1.

Why prefix caching changes agent economics — and why retrieval still matters for dynamic knowledge.

Why Agents Benefit Disproportionately

A typical agent API request stacks:

  • Tool definitions (often 5k–20k tokens)
  • System prompt (500–3k tokens)
  • Full message history including prior tool calls
  • The new user message or tool result at the end

In an agent loop with 10–15 internal tool calls before responding to the user, you send 10–15 requests where almost the entire prompt is a cache hit. Only the latest message suffix is new.

ScenarioStable prefixTurnsApprox. prefill avoided
Simple chatbot8k tokens5 turns~40k tokens of redundant prefill
Coding agent (Claude Code class)40k tokens12 tool calls~480k tokens
RAG agent with cached docs60k tokens8 turns~480k tokens

On Anthropic's April 2026 pricing, Claude Sonnet 4.6 charges $3 / 1M input but $0.30 / 1M cached reads — a 90% discount. The first request writes the cache; subsequent hits read at the lower rate. For a deeper dive on token economics and complementary optimizations, see our Caveman token compression guide.

Live WorkshopAug 1–2, 2026 · 2 days

Claude for Work

Use Claude as a thought partner for writing, research & decisions — no coding required. 2 live sessions with Yash Thakker.

Register now→

Claude for Work is a 2-day live workshop on using Claude to supercharge your daily work — writing, research, analysis, and decision-making — without any coding required. Learn how to set up Claude Projects with custom instructions, run deep-research sprints, co-write documents that sound like you, and build repeatable prompt systems for your team. August 1–2, 2026. Hosted by Yash Thakker, founder of AISOLO Technologies, instructor to 350,000+ students.

Includes 1-year access to all session recordings, a personal prompt library, Discord community access, and a certificate of completion. No coding or technical background required. Designed for managers, marketers, founders, and writers.

Provider Mechanics: Anthropic vs OpenAI

Both major providers implement prefix caching, but the control surface differs:

AspectAnthropic (Claude)OpenAI
ControlExplicit cache_control breakpoints on content blocksAutomatic on eligible requests
Minimum sizeVaries by model; typically 1,024+ tokens per cache blockAutomatic caching above 1,024 tokens
PricingSeparate cache write vs cache read ratesCached input at reduced rate on supported models
TTLConfigurable (e.g. 5-minute vs 1-hour cache)Provider-managed

From an application architecture perspective, the rule is identical: stable content at the front, changing content at the end.

Anthropic example with explicit cache breakpoints:

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are a coding assistant...",
            "cache_control": {"type": "ephemeral"},
        }
    ],
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": conversation_history,
                    "cache_control": {"type": "ephemeral"},
                },
                {
                    "type": "text",
                    "text": new_user_message,
                },
            ],
        }
    ],
)

Place cache_control on the last block of content you want cached — the breakpoint marks "cache everything up to and including this block." Dynamic content goes after the breakpoint without cache markers.

For Claude-specific prompt structure patterns that maximize cache hits, see the master prompt engineering guide.

The Decision Framework

Layer 1: System Prompt and Tool Definitions — Cache Without Thinking

This is the simplest case.

If system prompts and tools are identical for all users, shared caching is exactly what you want. Every user hitting the same cached system prompt saves everyone money and latency. There is nothing sensitive to leak — the content is public to your application anyway.

If they contain user-specific data, prefixes diverge across users automatically. Different tokens mean different prefixes mean no cross-user cache hits. The timing attack requires a matching prefix; user-specific system prompts will not match another user's probes.

Verdict: cache system prompts and tool definitions. Always.

Layer 2: The Messages Array — Where Savings and Risk Live

The messages array is where multi-turn and agentic cost savings concentrate. It is also where user data lives — so this is the only layer where caching creates a potential security tradeoff.

Step 1: Does your messages array contain sensitive user data?

  • No (internal tools, dev environments, team-only apps) → cache normally. Done.
  • Yes (most user-facing applications) → proceed to Step 2.

Step 2: Is the timing side channel actually exploitable in your setup?

For the attack to work, all three conditions must be true simultaneously:

PreconditionSafe if...
User can influence message contentYour server constructs prompts entirely; users cannot send probe prefixes
Rate limits are weakTight rate limits make the many requests needed to average out network noise impractical
User can observe timingResponse times are buried in multi-step agent latency (30s+ workflows) rather than mapping 1:1 to a single observable API call

If any precondition is false, cache the messages array normally. The attack surface does not exist in practice for most production applications.

If all three are true — sensitive data, user-controlled prompts, loose rate limits, observable latency — use cache salting (below).

The Security Model: Timing Side Channels

A cached response returns faster than an uncached one. That latency difference is measurable. An attacker who can send requests and observe time-to-first-token can probe whether a particular prefix exists in the cache.

Concrete example: In a multi-tenant app, User A submits a form with name, email, and phone. That conversation gets cached in the messages array. An attacker crafts prompts trying to reconstruct those values as prefixes:

  • "name: John, email: john@..." → fast response → cache hit
  • "name: Jane, email: jane@..." → slow response → cache miss

Repeat enough times and you can reconstruct what User A submitted.

Cross-Account vs Within-Account

Cross-account cache sharing is largely a solved problem. Research in 2025 demonstrated prompt reconstruction via KV-cache sharing (Wu et al., NDSS 2025), timing side channels on production services including Claude and Azure OpenAI (Song et al., IEEE 2025), and provider-wide auditing that led at least five providers to change implementations (Gu et al., ICML 2025). Major providers now scope caches to the account level.

Within your account — inside the application you are building — the theoretical risk remains if all three preconditions above are true.

For most applications, this sounds scarier than it is. Customer support bots with server-side prompt construction, agents with multi-second latency, and apps with standard API rate limits rarely satisfy all three conditions.

DIY Cache Salting for Multi-Tenant Apps

When you do need isolation, the fix is straightforward: inject a unique user identifier at the very beginning of the messages array, server-side, outside the user's control.

def build_messages(user_id: str, history: list, new_message: str) -> list:
    tenant_salt = {
        "role": "user",
        "content": f"[session:{user_id}]",
    }
    return [tenant_salt, *history, {"role": "user", "content": new_message}]

Because caching is prefix-based, User A's messages start with [session:abc-123] and User B's with [session:def-456]. The cache diverges at token one. No timing signal across tenants on the sensitive portion.

Why salt messages, not the system prompt?

Salting at the start of messages preserves shared caching on everything before it — tool definitions and system prompts still cache across all users. You only isolate the portion that contains sensitive data.

You lose cross-user cache hits on the messages portion, but in most multi-turn conversations messages are user-specific anyway — you were not going to get those hits regardless.

Model visibility: The model sees the salt as the first message. Either instruct it to ignore metadata tokens in your system prompt, or make the salt useful — pass personalization context the model can actually use:

tenant_salt = {
    "role": "user",
    "content": f"[User context: plan=pro, locale=en-US, session={session_id}]",
}

Structuring Prompts for Maximum Cache Hits

Beyond security, prompt architecture determines how much you save:

Do

  • Put static blocks first: tools → system → docs → history → new input
  • Use consistent formatting across requests (whitespace changes break prefixes)
  • Mark cache breakpoints on the last stable block before dynamic content
  • Keep system prompts stable across sessions; inject user context in messages instead

Avoid

  • Timestamps or random IDs in the system prompt (breaks cache for everyone)
  • Reordering message history between turns
  • Embedding dynamic content before static tool definitions
  • Putting user-specific data in the system prompt when messages would work (loses cross-user system cache without gaining security)

For system prompt design patterns that pair well with caching, see What Is a System Prompt?.

Economics: When Caching Might Not Be Worth It

This framework focuses on multi-turn agents where caching the messages array is almost always net-positive. There are edge cases where cache writes cost more than skipping cache entirely — for example, a single-request workload with a large prefix that will never repeat within the cache TTL.

Evaluate your specific provider's write vs read pricing:

Cost typeTypical pattern
Cache writeFull or elevated input price on first request
Cache read~90% discount on Anthropic; similar on OpenAI
Cache missFull input price, no write surcharge

Rule of thumb: if a prefix repeats at least twice within the cache TTL, caching wins. Agent loops with 5+ turns on a stable prefix are clear wins. One-shot batch jobs with unique prefixes every call may not benefit.

Combine caching with other cost levers from our token economics guide: batch APIs (50% off on both vendors), model routing, and output compression.

Putting It Together: Agent Architecture Checklist

  1. Cache tool definitions and system prompt with explicit breakpoints (Anthropic) or stable prefix structure (OpenAI).
  2. Grow conversation history at the end — never insert dynamic content mid-prefix.
  3. Run the three-factor exposure test on your messages array; salt if all preconditions are true.
  4. Monitor cache hit rates in your provider dashboard — cached input tokens should climb as sessions progress.
  5. Watch cache TTL — Anthropic's 5-minute vs 1-hour settings matter for long sessions; expired caches re-write at full cost.
  6. Pair with context engineering — dynamic RAG context belongs after cached blocks, not mixed into system prompts. See context engineering patterns.

Related Reading

  • Caveman skill: token economics and API pricing — complementary cost levers beyond caching
  • What Is a System Prompt? Complete Guide — where caching meets system prompt design
  • Master Prompt Engineering for Claude — cache-friendly XML structure patterns
  • What Are LLM Tokens? — how tokenization affects prefix matching
  • Claude Code Pricing Guide 2026 — real-world session costs with caching discounts

Sources

  • Andre Kreidemann, Prompt Caching: Just do it (March 2026) — decision framework this guide extends
  • Anthropic prompt caching documentation — cache_control breakpoints and pricing
  • Anthropic Claude pricing — cache read/write rates (retrieved June 2026)
  • OpenAI API pricing — cached input rates (retrieved June 2026)
  • Gu, C. et al., "Auditing Prompt Caching in Language Model APIs," ICML 2025. arXiv:2502.07776
  • Wu, G. et al., "I Know What You Asked: Prompt Leakage via KV-Cache Sharing in Multi-Tenant LLM Serving" (PROMPTPEEK), NDSS 2025
  • Song, J. et al., "The Early Bird Catches the Leak: Unveiling Timing Side Channels in LLM Serving Systems," IEEE 2025. arXiv:2409.20002
  • Luo, S. et al., "Shadow in the Cache: Unveiling and Mitigating Privacy Risks of KV-cache in LLM Inference," NDSS 2026. arXiv:2508.09442

Pricing, cache TTLs, and provider implementations are accurate as of June 2026 and may change. Verify current rates on official documentation before production deployment.

Related posts

Apr 9, 2026

Caveman skill: token economics, API pricing, and cutting verbose LLM output in agents

The Caveman skill compresses assistant surface prose (lite, full, ultra) while keeping code intact. Here is 2026 frontier pricing, output-vs-input math, and when brevity helps quality—not only cost.

Jun 21, 2026

Why Every AI Company Wants You Using Agents: The Token Economics Nobody Talks About

A single Claude Code /loop session burns more tokens than 50 chat messages. An agentic Codex browser-use task that writes code, pushes to GitHub, and configures Vercel burns more tokens than a week of casual ChatGPT use. Anthropic, OpenAI, and every AI company building agent products has aligned incentives: the more agentic your workflow, the more they earn. This is not a conspiracy. It is business model economics. Here is how to think about it.

May 30, 2026

Microsoft SkillOpt: The Self-Evolving Agent That Trains Documents, Not Models (52/52 Wins)

Microsoft's SkillOpt achieves 52 out of 52 wins against competitors by optimizing agent skills through validation-gated edits to a single Markdown file. The breakthrough delivers +23.5 average accuracy improvement while maintaining zero inference-time costs.