I've been mostly using Kimi has a hacker of sorts, putting it places I want to attach AI directly as their API for their plans are not completely user hostile. Need to do OCR for scanning Magic the Gathering Cards. Sure!, have it attached to X4: Foundations as a AI manager for some stuff. sounds fun. Can't really do that with claude
The cache is stored on Antropics servers, since its a save state of the LLM's weights at the time of processing. its several gigs in size. Every SINGLE TIME you send a message and its a cache miss you have to reprocess the entire message again eating up tons of tokens in the process
clarification though: the cache that's important to the GPU/NPU is loaded directly in the memory of the cards; it's not saved anywhere else. They could technically create cold storage of the tokens (vectors) and load that, but given how ephemeral all these viber coders are, it's unlikely there's any value in saving those vectors to load in.
So then it comes to what you're talking about, which is processing the entire text chain which is a different kind of cache, and generating the equivelent tokens are what's being costed.
But once you realize the efficiency of the product in extended sessions is cached in the immediate GPU hardware, then it's obvious that the oversold product can't just idle the GPU when sessions idle.
So to defend a litte, its a Cache, it has to go somewhere, its a save state of the model's inner workings at the time of the last message. so if it expires, it has to process the whole thing again. most people don't understand that every message the ENTIRE history of the conversion is processed again and again without that cache. That conversion might of hit several gigs worth of model weights and are you expecting them to keep that around for /all/ of your conversions you have had with it in separate sessions?
No? It's not because it's a cache, it's because they're scared of letting you see the thinking trace. If you got the trace you could just send it back in full when it got evicted from the cache. This is how open weight models work.
The issue is that if they send the full trace back, it will have to be processed from the start if the cache expired, and doing that will cause a huge one-time hit against your token limit if the session has grown large.
So what Boris talked about is stripping things out of the trace that goes back to regenerate the session if the cache expires. Doing this would help avert burning up the token limit, but it is technically a different conversation, so if CC chooses poorly on stripping parts of the context then it would lead to Claude getting all scatter-brained.
They literally can. They could make the API free to use if they wanted. There is no law that states that costs have to equal the cost it takes to process the request.
I’m not familiar with the Claude API but OpenAI has an encrypted thking messages option. You get something that you can send back but it is encrypted. Not available on Anthropic?
No of course it’s unrealistic for them to hold the cache indefinitely and that’s not the point. You are keeping the session data yourself so you can continue even after cache expiry. The point I‘m making is that it made me very angry that without any announcement they changed behavior to strip the old thinking even when you have it in your session file. There is absolutely no reason to not ask the user about if they want this
And it’s part of a larger problem of unannounced changes it‘s just like when they introduced adaptive thinking to 4.6 a few weeks ago without notice.
Also they seem to be completely unaware that some users might only use Claude code because they are used to it not stripping thinking in contrast to codex.
Anyway I‘m happy that they saw it as a valid refund reason
It seems like an opportunity for a hierarchical cache. Instead of just nuking all context on eviction, couldn’t there be an L2 cache with a longer eviction time so task switching for an hour doesn’t require a full session replay?
Living where? If it's in the GPU, then it's still taking up precious space that could be used for serving other sessions. If it's not in the GPU, then it doesn't help.
what matters isn't that it's a cache; what matter is it's cached _in the GPU/NPU_ memory and taking up space from another user's active session; to keep that cache in the GPU is a nonstarter for an oversold product. Even putting into cold storage means they still have to load it at the cost of the compute, generally speaking because it again, takes up space from an oversold product.
The cache is on Antropics server, its like a freeze frame of the LLM inner workings at the time. the LLM can pick up directly from this save state. as you can guess this save state has bits of the underlying model, their secret sauce. so it cannot be saved locally...
Maybe they could let users store an encrypted copy of the cache? Since the users wouldn't have Anthropic's keys, it wouldn't leak any information about the model (beyond perhaps its number of parameters judging by the size).
I'm unsure of the sizes needed for prompt cache, but I suspect its several gigs in size (A percentage of the model weight size), how would the user upload this every time they started a resumed a old idle session, also are they going to save /every/ session you do this with?
A few gigs of disk is not that expensive. Imo they should allocate every paying user (at least) one disk cache slot that doesn't expire after any time. Use it for their most recent long chat (a very short question-answer that could easily be replayed shouldn't evict a long convo).
I don't know how large the cache is, but Gemini guessed that the quantized cache size for Gemini 2.5 Pro / Claude 4 with 1M context size could be 78 gigabytes. ChatGPT guessed even bigger numbers. If someone is able to deliver a more precise estimate, you're welcome to :-).
So it would probably be a quite a long transfer to perform in these cases, probably not very feasible to implement at scale.
Whats lost on this thread is these caches are in very tight supply - they are literally on the GPUs running inference. the GPUs must load all the tokens in the conversation (expensive) and then continuing the conversation can leverage the GPU cache to avoid re-loading the full context up to that point. but obviously GPUs are in super tight supply, so if a thread has been dead for a while, they need to re-use the GPU for other customers.
They could let you nominate an S3 bucket (or Azure/GCP/etc equivalent). Instead of dropping data from the cache, they encrypt it and save it to the bucket; on a cache miss they check the bucket and try to reload from it. You pay for the bucket; you control the expiry time for it; if it costs too much you just turn it off.
Encryption can only ensure the confidentiality of a message from a non-trusted third party but when that non-trusted third party happens to be your own machine hosting Claude Code, then it is pointless. You can always dump the keys (from your memory) that were used to encrypt/decrypt the message and use it to reconstruct the model weights (from the dump of your memory).
jetbalsa said that the cache is on Anthropic's server, so the encryption and decryption would be server-side. You'd never see the encryption key, Anthropic would just give you an encrypted dump of the cache that would otherwise live on its server, and then decrypt with their own key when you replay the copy.
Oh god yes, I've been trying to make a LLM Assisted Magic the Gathering card scanner... its been a hell of a time trying to get it to just OCR card names well....
Yep, Its pretty damn good compared to classic OCR and even more lightweight ones as well that I can run locally. the cards just vary too much over time.
i will note that they really should of used something like ncurses and kept the animations down, TTYs are NOT meant to do the level of crazy modern TUIs are trying to pull off, there is just too many terminal emulators out there that just don't like the weird control codes being sent around.
That is why Typescript is the main one used by most people vibe coding, The LLMs do like to work around the type engine in it sometimes, but strong typing and linting can help a ton in it.
I am kind of doing that now. I put Kimi K2.5 into a Ralph Loop to make a Screeps.com AI. So far its been awful at it. If you want to track its progress, I have its dashboard at https://balsa.info
What where you using it for? claude is really good at agentic stuff, Pure coding, I can see codex being better, but for the entire workflow, I'm not sure
I use Codex purely for coding, and that's 90% of my use case for AI in general (10% using ChatGPT web for misc stuff). I pop out to Opus in Claude Code regularly to try to stay up on their relative performance, but so far the primary value I've been able to derive from CC is as a second set of eyes for code review / poking holes in plans. For primary planning / debugging / implementation Codex outclasses it atm sadly.
reply