LLM Prompt Caching: How to Cut AI Costs by ~90%

Written By  Crosscheck Team

Content Team

May 22, 2026 11 minutes

LLM Prompt Caching: How to Cut AI Costs by ~90%

LLM Prompt Caching, Explained — and What It Actually Saves in 2026

Prompt caching is the runtime feature that lets an LLM reuse the pre-computed attention over the static parts of your prompt — system instructions, tool definitions, long documents, few-shot examples — so you don't pay full input price for them on every call. On most providers in 2026, cached input tokens cost about 10% of the standard input rate, which is where the "90% cheaper" headline comes from. The catch is that caches are short-lived (typically 5 minutes by default, up to an hour), require an identical prefix, and only pay off when the same static block hits the model many times in a short window.

Key takeaways

  • Cached reads cost roughly 10% of input on Anthropic, AWS Bedrock, and Gemini Vertex AI; OpenAI's automatic cache lands at a 50–75% discount depending on model.
  • Anthropic uses explicit cache_control markers, OpenAI caches automatically above a 1,024-token prefix, and Gemini supports both implicit and explicit context caching on Gemini 2.5+ models.
  • Default TTL is short. Anthropic silently reverted Claude's default cache TTL from 1 hour to 5 minutes in March 2026 — production teams should set ttl explicitly.
  • Cache hits require byte-identical prefixes. A trailing whitespace change, a re-ordered tool definition, or a timestamp in the system prompt will reset the cache to zero.
  • Stacking matters. Batch API plus prompt caching can compound — OpenAI's combined discount lands around 75% on cached portions, and Anthropic's stack lets cached input drop to roughly 5% of base.

What prompt caching is, mechanically

Every transformer call has two costs: parsing the input into key/value tensors (the "prefill"), and generating the output one token at a time. Prefill scales linearly with input length and dominates the cost on long-context calls. Prompt caching stores those pre-computed key/value tensors keyed against the exact prefix bytes. When the next request arrives with the same prefix, the model skips the prefill for that block and starts attention from the cached state.

The practical effect is that a 20K-token system prompt that costs you input tokens on every call now costs full price once (the cache write), then ~10% of input price on every subsequent hit. For a chatbot that calls the model 100 times an hour with the same system prompt, that's a step-function cost change.

A few characteristics fall out of the mechanics:

  • Prefixes only. You cache from the start of the prompt up to a designated cut point. Anything after the cut runs fresh every call. That's why static-first ordering matters so much.
  • Bytes, not semantics. The cache key is the raw token sequence. Changing "you are a helpful assistant" to "You are a helpful assistant." (capital Y, trailing period) is a fresh prefix.
  • Short TTL. Cached state lives in expensive GPU memory. Providers evict it aggressively — typically 5 minutes default, with paid extensions to an hour.

How each major provider implements it

The headline pricing is similar across vendors but the developer experience differs sharply. Here is a side-by-side for May 2026, verified against each provider's current docs.

ProviderActivationMin prefixDefault TTLExtended TTLCache write costCache read costNotes
Anthropic API (Claude Opus 4.7, Sonnet 4.6, Haiku 4.5)Explicit cache_control: { type: "ephemeral" } markersNone (per block)5 min (changed from 1h in March 2026)1 hour via ttl: "1h"1.25x input (5m) / 2.0x input (1h)0.1x inputTool defs, system, and messages are all cacheable. Access resets TTL.
OpenAI API (GPT-5, GPT-5.4, GPT-5.5 family)Automatic1,024 tokens (then 128-token increments)5–10 min idle, up to ~1h off-peakNot user-configurableFree (no write surcharge)0.5x input standard, down to 0.25x on newer modelsPass prompt_cache_key to keep traffic on a shared cache shard.
Google Gemini API (Gemini 2.5 Flash, 2.5 Pro, 3.x)Implicit (default on 2.5+) or explicit CachedContent1,024 tok (Flash), 4,096 tok (Pro) for implicitImplicit ~3–5 min; explicit default 1hCustom TTL on explicit cachesFree for implicit; explicit charges per-token-per-hour storage~0.25x input (75% off) on Gemini API; 0.1x input on Vertex AIExplicit caches have storage fees (~$1/M tok/hr Flash, $4.50/M tok/hr Pro).
AWS Bedrock (Claude 4.5 series, Nova)Explicit cachePoint / cache_controlNone5 min1 hour on Claude 4.5 family1.25x input (5m) / 2.0x input (1h)0.1x inputNot supported with batch inference API. Per-account isolation.

A few things worth pulling out of the table.

Anthropic's cache_control is the model to learn first. It's explicit, predictable, and the read multiplier is the most aggressive on the market. The 5-minute default TTL reversion is real — independent analysis of Claude Code session logs by community engineers in April 2026 confirmed a 20–32% increase in cache creation costs from the change. Set ttl explicitly:

{
  "model": "claude-sonnet-4-6",
  "system": [
    {
      "type": "text",
      "text": "<long system prompt + retrieved docs>",
      "cache_control": { "type": "ephemeral", "ttl": "1h" }
    }
  ],
  "messages": [{ "role": "user", "content": "User question goes here" }]
}

OpenAI's automatic cache is invisible until you measure it. Caching kicks in for prefixes ≥1,024 tokens, increasing in 128-token increments. You verify hits via usage.prompt_tokens_details.cached_tokens on the response. If that value is zero on a long prompt, your prefix isn't stable — most often because a timestamp, request ID, or A/B variant got injected near the top.

OpenAI also lets you pass a prompt_cache_key parameter to keep related traffic on the same cache shard, which materially raises hit rates on multi-instance deployments.

Gemini's two modes need a deliberate choice. Implicit caching on Gemini 2.5+ is free, but the discount is a soft "we'll apply it if you hit a cache" — no guarantee. Explicit CachedContent gives a guaranteed discount but charges hourly storage. For a 50K-token reference document hit ten times a day, implicit is fine. For a 500K-token codebase fed into Gemini 2.5 Pro hundreds of times an hour, explicit caching with a 24-hour TTL is the move — even with storage costs, the read savings dominate.


A worked cost example — 8K system prompt, 100 calls/day

This is the math that decides whether you bother. Take a customer-support chatbot built on Claude Sonnet 4.6 ($3/M input, $15/M output):

  • Static prefix: 8,000 tokens of system instructions, brand voice, tool definitions, and 6 few-shot examples.
  • Per-turn user message: ~200 tokens.
  • Per-turn output: ~400 tokens.
  • Volume: 100 conversations per day, ~5 turns each = 500 calls/day.

Without prompt caching (full input pricing on every call):

  • Input per call: 8,000 + 200 = 8,200 tokens at $3/M = $0.0246
  • Output per call: 400 tokens at $15/M = $0.006
  • Per-call total: ~$0.0306
  • Daily: 500 × $0.0306 = $15.30
  • Monthly: ~$459

With prompt caching (5-minute TTL, assuming each conversation's 5 turns fall within the cache window):

  • Each conversation does 1 cache write (8,000 tokens × $3 × 1.25 = $0.030) + 4 cache reads (8,000 tokens × $3 × 0.1 = $0.0024 each).
  • Variable input per call: 200 tokens × $3/M = $0.0006
  • Output per call: 400 tokens × $15/M = $0.006
  • Per-conversation cost: $0.030 (write) + 4 × $0.0024 (reads) + 5 × $0.0066 (variable+output) = $0.0726
  • Daily: 100 × $0.0726 = $7.26
  • Monthly: ~$218

That's a ~53% reduction on this profile. The savings get more dramatic as conversations grow longer or the static prefix grows — and they grow steeper still with the 1-hour TTL if traffic is bursty.

Run the same math against Claude Opus 4.7 ($5/M input) with a 20K-token RAG context and 200 calls/hour and the monthly delta moves into thousands of dollars. The general rule: the bigger your static block relative to per-call variable content, and the higher your call frequency within the TTL, the closer to 90% your savings get.


When prompt caching helps — and when it doesn't

Caching pays off in proportion to (a) the static fraction of each prompt and (b) the request rate within the TTL. A quick map:

Strong fit:

  • Customer support chatbots with a long system prompt, brand-voice examples, and tool definitions.
  • Code assistants carrying a repo context, style guide, and toolchain definitions across many calls.
  • RAG pipelines where the same retrieved chunks get re-fed to the model on follow-up questions.
  • Agents that pass the same tool schemas on every step of a loop.
  • Document Q&A where one long document is queried many times.

Weak or no fit:

  • Highly variable prompts where each call has a different system message — there's nothing stable to cache.
  • Long-tail, low-frequency traffic. If a given prefix is hit once every 30 minutes, the cache evicts between hits and you pay write costs repeatedly.
  • Prompts under the minimum prefix. OpenAI ignores anything below 1,024 tokens; Gemini Pro implicit ignores anything below 4,096.
  • One-shot batch jobs with unique prompts per item.

It's also worth noting that caching only reduces input cost. Output tokens are billed at full rate regardless, so caching has less impact on output-heavy workloads (long-form generation, creative writing) than on input-heavy ones (analysis over long context).


Practical patterns that actually work

After two years of production usage across the major providers, a handful of patterns reliably move the needle.

Pin the static block to the top. This is non-negotiable. Tool definitions first, then system instructions, then long documents or examples, then — finally — the user's variable input. Anything dynamic that creeps into the prefix invalidates everything behind it.

Treat the static block like a deployment artifact. If three engineers each tweak the system prompt for an experiment, you'll fragment your cache across three prefixes and get hit rates near zero. Version it, lint it, ship it like code.

Cache tool definitions even if you don't think they're "static." Tool/function schemas are often the largest stable block in an agent system. Anthropic explicitly lists tool definitions as cacheable, and on a 30-tool agent the schemas alone can run 4–8K tokens.

Use prompt_cache_key on OpenAI deployments. OpenAI's docs are explicit that without a stable cache key, requests can land on different shards and miss the cache. A per-tenant or per-feature key keeps similar traffic together.

Combine with the batch API where latency allows. OpenAI's batch API (50% off) and Anthropic's message batches stack with prompt caching. On Anthropic, the combined discount pushes cached batch input to roughly 5% of base — useful for nightly classification or extraction jobs over a large corpus. (Note: AWS Bedrock's batch inference does not currently support prompt caching, per their docs — that combination only works on the direct provider APIs.)

Measure cache hit rate as a first-class metric. Every provider exposes cache stats in the response object. Anthropic returns cache_creation_input_tokens and cache_read_input_tokens; OpenAI returns cached_tokens in usage.prompt_tokens_details; Gemini returns cached_content_token_count. Track the ratio. A drop in hit rate is almost always a regression — a recent prompt change, a new injected variable, or a TTL window getting missed.

Pre-warm bursty caches. If you know a workload will spike at the top of every hour, fire a single "warm-up" call ~30 seconds before traffic arrives. Anthropic's TTL resets on access, so a well-timed warm-up keeps the cache alive for the entire spike window.


Pitfalls that quietly cost money

The failure modes are subtle. Most teams don't realise their cache is dead until they look at the bill.

Whitespace and ordering drift. Caches key on raw bytes. If your prompt template renders with a trailing newline on weekdays and without one on weekends (yes, this happens), you have two separate caches. Lint your templates for stable serialisation.

Timestamps in system prompts. A surprisingly common pattern is "Current date: 2026-05-22" injected at the top of the system message for date-awareness. That single line invalidates the cache the moment the clock ticks over. Move it to the end of the prompt, or quantise it to the hour if you can tolerate the staleness.

Tool re-ordering. Some SDKs serialise tools in object-iteration order, which isn't guaranteed to be stable. Sort tool definitions alphabetically or by ID before sending.

TTL gaps in human-paced workflows. A 5-minute TTL doesn't survive a coffee break. Support agents leaving a chat idle for 8 minutes, code assistants between user keystrokes, customer-facing forms with multi-minute fields — all of these will pay full write cost on the next interaction. Either move to the 1-hour TTL where supported, or accept the math.

Forgetting the write cost. A common spreadsheet error treats cached reads as the only cost. Anthropic's 5-minute write is 1.25x input, the 1-hour write is 2.0x input — meaningful when the cache is hit only once or twice before eviction. The breakeven on a 5-minute cache is two reads; on a 1-hour cache it's more like five.

Workspace and account isolation. Since February 2026, Anthropic prompt caches are isolated per workspace (not per organisation) on the Claude API. AWS Bedrock caches are per AWS account. If you run multi-tenant traffic across workspaces or accounts, you don't get cross-tenant cache benefits.

Batch API confusion on Bedrock. Worth restating: AWS Bedrock's batch inference does not currently support prompt caching, even though Anthropic's direct message batches do. Don't assume parity across hosting platforms.


Where this is heading in 2026

Three shifts are worth tracking through the rest of the year.

The first is cache-aware routing in agent frameworks. LangChain, LlamaIndex, and AI SDK have all shipped first-class prompt-cache support, with framework-level helpers for setting cache_control markers and inspecting hit rates. Expect prompt caching to be on by default in most agent scaffolds by Q3 2026.

The second is provider competition on read discounts. Anthropic and Bedrock's 90% cached-read discount has set the bar; OpenAI's automatic caching gives a less aggressive 50–75% off but with zero developer effort. Watch for OpenAI to expand the gap with newer models — the GPT-5.5 cached rate ($1.25/M against $5/M) already hints at where this is going.

The third is longer TTLs and persistent caches. The 5-minute default is a GPU-memory-cost compromise, not a hard limit. Anthropic, Gemini, and Bedrock all now offer 1-hour options; expect "session-length" caches measured in days to appear on enterprise tiers as memory tiering improves. Anthropic has hinted at this on the public roadmap; Gemini's CachedContent API already supports arbitrary TTL with hourly storage billing.

For most teams in 2026, the question isn't whether to use prompt caching — it's how aggressively to instrument and lint the prefix so the cache actually hits. The default behaviour buys you something. The deliberate practice buys you the headline 90%.

For related reading on adjacent topics, see the best AI testing tools 2026 breakdown and the deeper look at AI test generation with LLMs, both of which touch on production LLM cost patterns. The official docs are the source of truth on syntax: Anthropic prompt caching, OpenAI prompt caching, and Gemini context caching.


FAQ

What is the difference between Anthropic's 5-minute and 1-hour cache?

Both are "ephemeral" caches that store the same key/value tensors, just with different eviction policies. The 5-minute TTL costs 1.25x input on write and is the default if you don't specify a TTL. The 1-hour TTL costs 2.0x input on write and is opted into with "ttl": "1h" in the cache_control object. Cached reads are identical (0.1x input) for both. Use the 1-hour cache when traffic is bursty enough that you'd otherwise pay write costs more than once an hour.

Does prompt caching change the model's output?

No. Prompt caching is purely a runtime optimisation — it reuses pre-computed attention state for the static prefix but the model still generates from the same distribution. Outputs are deterministic to the same degree they would be without caching (i.e., still affected by temperature and sampling settings).

How do I know if my cache is actually hitting?

Every provider returns cache statistics in the response. On Anthropic, check usage.cache_creation_input_tokens (new write) and usage.cache_read_input_tokens (hit). On OpenAI, check usage.prompt_tokens_details.cached_tokens. On Gemini, check usage_metadata.cached_content_token_count. If those values are zero on a long prompt, your prefix isn't stable.

Can I cache the user's message too?

Generally no — user messages are the variable part of the prompt and would invalidate the cache on every call. Anthropic does let you place cache_control markers on any block, including assistant turns in a long conversation, which lets multi-turn chats progressively extend the cache. But the first cacheable block has to be a stable prefix.

What happens to the cache when I deploy a new system prompt?

You invalidate it. Every active cache that was keyed against the old prompt becomes a miss on the next call, and you'll pay write cost to re-establish the new prefix. For high-traffic deployments, this is a real cost spike — schedule prompt changes carefully, and consider rolling them out behind a per-tenant flag to avoid invalidating every cache at once.

Does prompt caching help with output cost?

No. Caching only reduces the input (prefill) cost. Output tokens are generated fresh on every call at full output price. Workloads with short prompts and long outputs (creative writing, long-form generation) see modest savings from caching; workloads with long prompts and short outputs (extraction, classification, RAG Q&A) see the biggest savings.

Is prompt caching the same as a semantic cache?

No. Prompt caching reuses pre-computed attention for a byte-identical prefix on the same provider. A semantic cache sits in front of the model and returns a stored response when an incoming query is semantically similar to a previous one — skipping the model call entirely. They solve different problems and stack well: semantic cache for full deflection, prompt cache for the calls that get through.


Where Crosscheck fits

Cost-optimised LLM systems are still LLM systems — they fail in weird ways, the prompts change under you, and the bug reports your team files about them are only as useful as the context they capture. Crosscheck is a free Chrome extension that grabs the screenshot, console errors, network calls, and replay automatically when a tester hits an issue, so the engineer fixing the LLM pipeline doesn't have to reconstruct what happened from a Slack message.

Try Crosscheck free

Related Articles

Contact us
to find out how this model can streamline your business!
Crosscheck Logo
Crosscheck Logo
Crosscheck Logo

Speed up bug reporting by 50% and
make it twice as effortless.

Overall rating: 5/5