
In March 2026, Anthropic quietly cut the prompt cache TTL from 60 minutes to 5 minutes. No changelog. No announcement. Just a spike in token bills and a GitHub issue with 200+ thumbs-up from developers who saw their API costs jump 30–60% overnight. The only signal the API gave: cache_read_input_tokens dropping to zero. No error. No reason. Nothing to go on.
Cache diagnostics — now in public beta — fixes that blind spot. Add one header and one parameter, and the API tells you exactly why your cache missed: the model changed, your system prompt has a timestamp buried in it, your tool definitions serialized in the wrong order. Stop guessing. Start fixing.
What Cache Diagnostics Actually Does
Prompt caching cuts input token costs by 90% on cache hits. But the cache only works when the beginning of your prompt is byte-for-byte identical to a recent request. One dynamic value, one reordered tool definition, one model switch — and the entire cached prefix is invalidated. Until now, developers had no way to know which of those was the culprit.
Cache diagnostics closes the gap by comparing consecutive request fingerprints. It does not store your prompt text — only cryptographic hashes and token-count estimates scoped to your workspace. It is Zero Data Retention eligible and adds no cost.
Enabling It Takes Three Steps
Add the beta header cache-diagnosis-2026-04-07 to every request. On the first turn, pass previous_message_id: null to opt in. On every subsequent turn, pass the id from the previous response. The API does the comparison automatically.
import anthropic
client = anthropic.Anthropic()
SYSTEM = "You are a technical assistant. Here is the full API reference: <docs>...</docs>"
# Turn 1: opt in with previous_message_id=None
r1 = client.beta.messages.create(
model="claude-opus-4-7",
max_tokens=1024,
cache_control={"type": "ephemeral"},
system=SYSTEM,
messages=[{"role": "user", "content": "Explain authentication."}],
diagnostics={"previous_message_id": None},
betas=["cache-diagnosis-2026-04-07"],
)
# Turn 2: compare against the previous request
r2 = client.beta.messages.create(
model="claude-opus-4-7",
max_tokens=1024,
cache_control={"type": "ephemeral"},
system=SYSTEM,
messages=[
{"role": "user", "content": "Explain authentication."},
{"role": "assistant", "content": r1.content},
{"role": "user", "content": "Now explain rate limiting."},
],
diagnostics={"previous_message_id": r1.id},
betas=["cache-diagnosis-2026-04-07"],
)
if r2.diagnostics is None:
print("No divergence — cache working correctly.")
elif r2.diagnostics.cache_miss_reason:
reason = r2.diagnostics.cache_miss_reason
print(f"Cache miss: {reason.type}")
print(f"Tokens lost: ~{reason.cache_missed_input_tokens:,}")
For multi-turn conversations, carry prev_id = r.id forward on each iteration and pass it as previous_message_id. The pattern is the same — you thread it through the loop and log any miss reasons as they appear.
Six Reasons Your Cache Is Missing
The cache_miss_reason.type field is a discriminated union. The API reports the earliest divergence only — fix it first, because later ones may be hidden behind it.
- model_changed: A router, A/B test, or fallback selected a different model. The cache is per-model. Hold the model constant within a cached conversation.
- system_changed: The most common cause. A timestamp, request UUID, or user ID is being interpolated into the system prompt. Every request gets a unique system prompt, every request misses the cache. Move dynamic data into the first user message after your cache breakpoint.
- tools_changed: Tools were added, removed, reordered, or their JSON schemas were serialized non-deterministically. Send the same tool list in a fixed order with sorted, deterministic serialization on every turn.
- messages_changed: Earlier conversation history was truncated or edited rather than appended. Treat history as append-only — echo assistant content and tool results back verbatim.
- previous_message_not_found: No fingerprint exists for the supplied ID. The beta header was missing on the prior request, requests came from different workspaces, or too much time passed between turns.
- unavailable: Another prompt-affecting parameter changed (
thinking,tool_choice,context_management, active beta headers) or the conversation is too long for the comparison horizon. Keep all prompt-affecting parameters constant across a cached session.
Each *_changed type also carries a cache_missed_input_tokens estimate — the number of tokens that fell after the divergence point. This is a magnitude indicator, not a billing number, but it tells you how much cacheable prefix you are burning on every request.
Read Diagnostics and Usage Together
The diagnostic result and usage.cache_read_input_tokens answer different questions. Combining them tells you what to do:
- Diagnostics null + high cache reads: Working correctly. No action needed.
- Diagnostics null + near-zero cache reads: Requests match, but the cache entry expired before the next request arrived. Shorten gaps between turns, or switch to the 1-hour cache TTL.
- *_changed reason + near-zero cache reads: Your requests are changing. Fix the cause indicated by the type field.
- *_changed reason + high cache reads: Rare. A late change occurred but an earlier cache breakpoint still hit. Low impact, still worth fixing.
The Cost Math Is Not Abstract
On Claude Sonnet 4.6, cache reads cost $0.30 per million tokens — 90% cheaper than the $3.00 standard input rate. For a workload with 50,000-token system prompts running 100 requests per day, the difference between a functioning cache (90% hit rate) and a broken one (30% hit rate) is roughly $10.50 per day, or about $3,800 per year. That is a lot of money to lose to a stray timestamp.
The diagnostics feature costs nothing extra. One header. One parameter. Add it to your next deploy, check your first missed turn in the console, and find out what has been draining your token budget. The March TTL change already happened — the only variable left is whether you can see the misses coming.
One Hard Limit to Know
Cache diagnostics is available on the Claude API only. It does not work on Amazon Bedrock or Vertex AI. If you route Claude requests through either platform, you need to go direct to the Anthropic API to use this feature. Given what a functioning prompt cache is worth at scale, that routing change pays for itself quickly.













