Prompt Caching Calculator

Estimate cache hit rates, compare caching strategies, and project monthly cost savings for Claude prompt caching. Input your traffic patterns and system prompt size to see how much you can save.

By Michael Lip · May 25, 2026

Traffic Configuration

Min: 1,024 (Haiku) or 2,048 (Sonnet/Opus) for caching
120
Average requests per hour across a typical day
16
5
Anthropic default: 5 min. Resets on every cache hit.
Daily Requests
1,920
Monthly Requests
57,600
Cache Hit Rate
99.2%

Cost Comparison

Full Cache

-85%
System prompt cached

Partial Cache

-60%
50% of prompt cached

No Cache

$0
Baseline cost

Monthly Cost Projection

$0.00
Estimated monthly savings with full caching
No Cache / mo
$0.00
Full Cache / mo
$0.00
Partial Cache / mo
$0.00

What This Calculator Does

The Prompt Caching Calculator helps you estimate the financial impact of enabling prompt caching on the Claude API. Prompt caching is a feature that stores the processed representation of static content (system prompts, few-shot examples, long context documents) so that subsequent requests do not need to reprocess those tokens. This reduces both latency and cost. The calculator takes your traffic patterns, system prompt size, and model selection as inputs, then projects monthly costs under three scenarios: no caching, full caching, and partial caching.

The core calculation models cache hit rates based on your request frequency relative to the cache TTL. If you send at least one request within every TTL window (5 minutes by default), the cache stays warm and every request after the initial write benefits from the 90% discount on cached input tokens. The calculator factors in cache write costs (25% premium on the first request), cache read savings (90% discount), and the probability of cache misses based on your traffic distribution across active hours.

How Prompt Caching Works Internally

When you send a request with cache_control blocks, Anthropic's infrastructure checks whether the exact sequence of tokens up to that cache breakpoint already exists in the cache. If it does (a cache hit), the server skips the computation-heavy prefill phase for those tokens and loads the precomputed key-value attention pairs directly. This is why cached requests are up to 85% faster: the model does not need to process the cached tokens through every transformer layer again.

The cache is keyed on the exact token sequence, meaning any change to the cached content (even a single character) invalidates the cache. This is why prompt caching works best with truly static content like system prompts, reference documents, and fixed few-shot examples. Dynamic content like user messages and conversation history should come after the cache breakpoint so they do not invalidate the cached prefix.

You can place up to 4 cache breakpoints in a single request, creating a multi-layer caching strategy. The first breakpoint typically covers the system prompt, the second can cover a fixed conversation prefix or few-shot examples, and additional breakpoints can cover semi-static context that changes less frequently than the user message. Each layer is cached independently, so a change in the second layer does not invalidate the first. For monitoring your actual cache hit rates and token usage in production, KickLLM provides LLM cost tracking dashboards.

Understanding the Three Caching Strategies

The Full Cache strategy caches the entire system prompt and any static context. This maximizes savings because the largest token payload (system prompts are typically 1,000 to 10,000+ tokens) gets the 90% read discount on every cache hit. The trade-off is that any modification to the system prompt invalidates the entire cache and triggers a new cache write at the 25% premium. This strategy works best for applications with a stable system prompt that changes infrequently, such as customer support bots, code assistants, and content generation tools.

The Partial Cache strategy caches only the most stable portion of your prompt prefix. For example, if your system prompt has a fixed instructions section (2,000 tokens) and a dynamic context section (2,000 tokens) that changes per user, you cache only the fixed section. This gives you a smaller per-request savings but higher cache hit rates because the cached portion never changes. This strategy is ideal when your prompts have both stable and volatile sections, such as RAG applications where the system instructions are fixed but the retrieved documents change per query.

The No Cache baseline shows what you would pay without any caching. Every request processes the full system prompt as standard input tokens at full price. Comparing this against the cached strategies shows the exact dollar value of implementing caching. For most applications with system prompts over 2,000 tokens and more than 10 requests per hour, caching pays for itself within the first few minutes of operation.

Cache Hit Rate Estimation Model

The calculator estimates cache hit rates using a simple model based on request frequency and TTL. If your average inter-request gap is shorter than the TTL, the cache stays warm and the hit rate approaches 100% minus the fraction of requests that are initial cache writes (one per TTL window). For example, with 120 requests per hour and a 5-minute TTL, you get approximately 10 requests per TTL window, meaning 1 out of every 10 is a cache write and 9 are cache hits, yielding a 90% hit rate within each window. However, because the TTL resets on every hit, the actual hit rate is higher: closer to 99% once the cache is established.

The model also accounts for traffic gaps between active hours. If your application has 16 active hours per day, there are 8 hours where no requests arrive, causing the cache to expire. Each new active period starts with a cache write. For applications that run 24/7 with consistent traffic, the cache effectively never expires and the hit rate approaches the theoretical maximum. For applications with bursty traffic, the calculator adjusts the hit rate based on the expected number of cold starts per day. For side-by-side model comparison to choose the right model for your caching strategy, try LockML.

Pricing Reference for Caching

Prompt caching uses three price tiers for input tokens. Standard input is the base rate: $0.80/M for Haiku 3.5, $3.00/M for Sonnet 4, $15.00/M for Opus 4. Cache write is 25% more than standard: $1.00/M for Haiku, $3.75/M for Sonnet, $18.75/M for Opus. Cache read is 90% less than standard: $0.08/M for Haiku, $0.30/M for Sonnet, $1.50/M for Opus. Output tokens are always charged at the standard rate regardless of caching.

The break-even analysis is straightforward. A cache write costs 1.25x standard. A cache read costs 0.10x standard. After the initial write, every subsequent cache hit saves 0.90x standard. So you break even after just 1.4 cache reads per write (1.25 / 0.90). In practice, this means caching is profitable whenever you send more than 2 requests with the same cached prefix within a 5-minute window. For high-traffic applications, the savings compound dramatically.

When Prompt Caching Does Not Help

Caching provides minimal benefit in several scenarios. If your system prompt is below the minimum token threshold (1,024 for Haiku, 2,048 for Sonnet/Opus), caching is not available. If every request has a unique system prompt that changes per user or per session, you never get cache hits and you pay the 25% write premium on every request, actually increasing costs. If your traffic is very low (fewer than 1 request per 5 minutes), the cache expires between requests and you pay the write premium repeatedly without sufficient reads to recoup the cost.

Caching also does not reduce output token costs, which are often the dominant cost component for applications that generate long responses. If your application generates 2,000 output tokens per request with a 500-token system prompt, caching the system prompt saves a fraction of the total cost. The calculator shows both input and output cost breakdowns so you can see the relative impact of caching on your total spend. For webhook-driven architectures that benefit from caching, InvokeBot provides webhook management tools.

Implementation Tips

To implement prompt caching, add a cache_control block with {"type": "ephemeral"} to the last content block you want cached. Place this on the system prompt for simple cases, or on the last element of a multi-turn conversation prefix for conversation caching. The API response includes cache_creation_input_tokens and cache_read_input_tokens fields in the usage object so you can monitor actual hit rates.

For production monitoring, track three metrics: cache hit rate (cache_read_input_tokens divided by total cached tokens), cost per request with and without caching, and cache-related latency improvements. Most applications see P50 latency drop by 60-80% for cached requests due to skipping the prefill computation. Use the ClaudKit API Request Builder to generate properly formatted requests with cache_control blocks, and review the API Error Guide for troubleshooting cache-related issues.

Frequently Asked Questions

How does Claude prompt caching work?

Claude prompt caching stores the processed representation of your system prompt and static message prefixes on Anthropic's servers. When a subsequent request uses the same cached content, the API skips reprocessing those tokens, reducing latency by up to 85% and cost by 90% on cached input tokens. You enable caching by adding a cache_control block with type "ephemeral" to the content you want cached. The cache has a 5-minute TTL that resets with each cache hit, so active applications maintain persistent caches.

How much does prompt caching save on Claude API costs?

Cached input tokens cost 90% less than standard input tokens. For example, on Claude Sonnet 4, standard input costs $3.00 per million tokens while cached input costs $0.30 per million tokens. However, cache writes cost 25% more than standard input at $3.75 per million tokens. The break-even point is typically 2 requests with the same cached content. For high-volume applications with stable system prompts, monthly savings can exceed 80% of input token costs.

What is the minimum token count for prompt caching?

Prompt caching requires a minimum of 1,024 tokens for Claude Haiku 3.5 and 2,048 tokens for Claude Sonnet 4 and Opus 4. Content below these thresholds cannot be cached. If your system prompt is below the minimum, consider adding few-shot examples, detailed instructions, or reference documentation to the cached content to reach the threshold while improving response quality.

What is the cache TTL and how does it refresh?

The cache time-to-live (TTL) is 5 minutes. Every cache hit resets the TTL, so as long as at least one request uses the cached content within every 5-minute window, the cache stays active indefinitely. If no requests arrive within 5 minutes, the cache expires and the next request triggers a cache write at the 25% premium. For applications with bursty traffic, this means you may pay cache write costs during low-traffic periods and benefit from cache read savings during high-traffic periods.

Can I cache different parts of the prompt separately?

Yes, you can use up to 4 cache breakpoints in a single request. This enables multi-layer caching strategies. For example, cache a long system prompt as the first layer and a conversation history prefix as the second layer. Each breakpoint creates a separate cached segment. The API processes cache breakpoints from the beginning of the prompt forward, so place the most stable content first. Partial caching works well when you have both stable and dynamic content in your prompts.

Developer and creator of the Zovo Tools network. Building free, privacy-first developer tools that run entirely in the browser. No tracking, no sign-ups, no server-side processing. Open source on GitHub.