More

jetbalsa · 2026-04-24T18:42:23 1777056143

I've been mostly using Kimi has a hacker of sorts, putting it places I want to attach AI directly as their API for their plans are not completely user hostile. Need to do OCR for scanning Magic the Gathering Cards. Sure!, have it attached to X4: Foundations as a AI manager for some stuff. sounds fun. Can't really do that with claude

jetbalsa · 2026-04-24T16:47:22 1777049242

That or having it start shit posting about your crappy code base on https://moltshit.com

jetbalsa · 2026-04-23T19:56:45 1776974205

The cache is stored on Antropics servers, since its a save state of the LLM's weights at the time of processing. its several gigs in size. Every SINGLE TIME you send a message and its a cache miss you have to reprocess the entire message again eating up tons of tokens in the process

cyanydeez · 2026-04-23T22:00:41 1776981641

clarification though: the cache that's important to the GPU/NPU is loaded directly in the memory of the cards; it's not saved anywhere else. They could technically create cold storage of the tokens (vectors) and load that, but given how ephemeral all these viber coders are, it's unlikely there's any value in saving those vectors to load in.

So then it comes to what you're talking about, which is processing the entire text chain which is a different kind of cache, and generating the equivelent tokens are what's being costed.

But once you realize the efficiency of the product in extended sessions is cached in the immediate GPU hardware, then it's obvious that the oversold product can't just idle the GPU when sessions idle.

jetbalsa · 2026-04-23T19:54:36 1776974076

So to defend a litte, its a Cache, it has to go somewhere, its a save state of the model's inner workings at the time of the last message. so if it expires, it has to process the whole thing again. most people don't understand that every message the ENTIRE history of the conversion is processed again and again without that cache. That conversion might of hit several gigs worth of model weights and are you expecting them to keep that around for /all/ of your conversions you have had with it in separate sessions?

3836293648 · 2026-04-23T20:00:22 1776974422

No? It's not because it's a cache, it's because they're scared of letting you see the thinking trace. If you got the trace you could just send it back in full when it got evicted from the cache. This is how open weight models work.

mpyne · 2026-04-23T22:12:58 1776982378

The trace goes back fine, that's not the issue.

The issue is that if they send the full trace back, it will have to be processed from the start if the cache expired, and doing that will cause a huge one-time hit against your token limit if the session has grown large.

So what Boris talked about is stripping things out of the trace that goes back to regenerate the session if the cache expires. Doing this would help avert burning up the token limit, but it is technically a different conversation, so if CC chooses poorly on stripping parts of the context then it would lead to Claude getting all scatter-brained.

charcircuit · 2026-04-24T06:55:08 1777013708

>and doing that will cause a huge one-time hit against your token limit if the session has grown large.

Anthropic already profited from generating those tokens. They can afford subsidize reloading context.

pixl97 · 2026-04-24T19:04:16 1777057456

No they can't, that's what you don't seem to get.

Reloading those tokens takes around the same effort as processing them in the first place.

It's ok to be ignorant of how the infrastructure for LLMs work, just don't be proud of it.

charcircuit · 2026-04-25T00:12:38 1777075958

They literally can. They could make the API free to use if they wanted. There is no law that states that costs have to equal the cost it takes to process the request.

eknkc · 2026-04-23T20:09:42 1776974982

I’m not familiar with the Claude API but OpenAI has an encrypted thking messages option. You get something that you can send back but it is encrypted. Not available on Anthropic?

reactordev · 2026-04-23T20:15:32 1776975332

They are sending it back to the cache, the part you are missing is they were charging you for it.

eknkc · 2026-04-23T20:24:06 1776975846

The blog post says they prune them now not to charge you. That’s the change they implemented.

reactordev · 2026-04-23T20:53:23 1776977603

right. they were charging you for it, now they aren't because they are just dropping your conversation history.

CjHuber · 2026-04-23T20:55:54 1776977754

No of course it’s unrealistic for them to hold the cache indefinitely and that’s not the point. You are keeping the session data yourself so you can continue even after cache expiry. The point I‘m making is that it made me very angry that without any announcement they changed behavior to strip the old thinking even when you have it in your session file. There is absolutely no reason to not ask the user about if they want this

And it’s part of a larger problem of unannounced changes it‘s just like when they introduced adaptive thinking to 4.6 a few weeks ago without notice.

Also they seem to be completely unaware that some users might only use Claude code because they are used to it not stripping thinking in contrast to codex.

Anyway I‘m happy that they saw it as a valid refund reason

rsfern · 2026-04-23T20:39:47 1776976787

It seems like an opportunity for a hierarchical cache. Instead of just nuking all context on eviction, couldn’t there be an L2 cache with a longer eviction time so task switching for an hour doesn’t require a full session replay?

sfink · 2026-04-24T16:34:55 1777048495

Living where? If it's in the GPU, then it's still taking up precious space that could be used for serving other sessions. If it's not in the GPU, then it doesn't help.

cyanydeez · 2026-04-23T21:51:32 1776981092

what matters isn't that it's a cache; what matter is it's cached _in the GPU/NPU_ memory and taking up space from another user's active session; to keep that cache in the GPU is a nonstarter for an oversold product. Even putting into cold storage means they still have to load it at the cost of the compute, generally speaking because it again, takes up space from an oversold product.

jetbalsa · 2026-04-23T19:52:02 1776973922

The cache is on Antropics server, its like a freeze frame of the LLM inner workings at the time. the LLM can pick up directly from this save state. as you can guess this save state has bits of the underlying model, their secret sauce. so it cannot be saved locally...

dicethrowaway1 · 2026-04-23T20:04:52 1776974692

Maybe they could let users store an encrypted copy of the cache? Since the users wouldn't have Anthropic's keys, it wouldn't leak any information about the model (beyond perhaps its number of parameters judging by the size).

jetbalsa · 2026-04-23T20:09:28 1776974968

I'm unsure of the sizes needed for prompt cache, but I suspect its several gigs in size (A percentage of the model weight size), how would the user upload this every time they started a resumed a old idle session, also are they going to save /every/ session you do this with?

im3w1l · 2026-04-23T20:16:53 1776975413

A few gigs of disk is not that expensive. Imo they should allocate every paying user (at least) one disk cache slot that doesn't expire after any time. Use it for their most recent long chat (a very short question-answer that could easily be replayed shouldn't evict a long convo).

_flux · 2026-04-24T08:18:47 1777018727

I don't know how large the cache is, but Gemini guessed that the quantized cache size for Gemini 2.5 Pro / Claude 4 with 1M context size could be 78 gigabytes. ChatGPT guessed even bigger numbers. If someone is able to deliver a more precise estimate, you're welcome to :-).

So it would probably be a quite a long transfer to perform in these cases, probably not very feasible to implement at scale.

spunker540 · 2026-04-24T01:37:36 1776994656

Whats lost on this thread is these caches are in very tight supply - they are literally on the GPUs running inference. the GPUs must load all the tokens in the conversation (expensive) and then continuing the conversation can leverage the GPU cache to avoid re-loading the full context up to that point. but obviously GPUs are in super tight supply, so if a thread has been dead for a while, they need to re-use the GPU for other customers.

skissane · 2026-04-23T20:35:20 1776976520

They could let you nominate an S3 bucket (or Azure/GCP/etc equivalent). Instead of dropping data from the cache, they encrypt it and save it to the bucket; on a cache miss they check the bucket and try to reload from it. You pay for the bucket; you control the expiry time for it; if it costs too much you just turn it off.

northern-lights · 2026-04-23T21:45:03 1776980703

Encryption can only ensure the confidentiality of a message from a non-trusted third party but when that non-trusted third party happens to be your own machine hosting Claude Code, then it is pointless. You can always dump the keys (from your memory) that were used to encrypt/decrypt the message and use it to reconstruct the model weights (from the dump of your memory).

dicethrowaway1 · 2026-04-23T22:36:46 1776983806

jetbalsa said that the cache is on Anthropic's server, so the encryption and decryption would be server-side. You'd never see the encryption key, Anthropic would just give you an encrypted dump of the cache that would otherwise live on its server, and then decrypt with their own key when you replay the copy.

jetbalsa · 2026-04-22T01:21:45 1776820905

Oh god yes, I've been trying to make a LLM Assisted Magic the Gathering card scanner... its been a hell of a time trying to get it to just OCR card names well....

what · 2026-04-22T03:03:15 1776826995

Why would you use an LLM for OCR?

fennecfoxy · 2026-04-22T16:56:12 1776876972

Because if it's multimodal, oops all transformers and they're pretty much best in class for ocr now, afaik?

jetbalsa · 2026-04-23T20:10:39 1776975039

Yep, Its pretty damn good compared to classic OCR and even more lightweight ones as well that I can run locally. the cards just vary too much over time.

jubilanti · 2026-04-22T13:53:07 1776865987

Because apparently that's what programming is and can only be these days...

jetbalsa · 2026-03-31T18:17:25 1774981045

i will note that they really should of used something like ncurses and kept the animations down, TTYs are NOT meant to do the level of crazy modern TUIs are trying to pull off, there is just too many terminal emulators out there that just don't like the weird control codes being sent around.

jetbalsa · 2026-03-10T17:30:23 1773163823

That is why Typescript is the main one used by most people vibe coding, The LLMs do like to work around the type engine in it sometimes, but strong typing and linting can help a ton in it.

jetbalsa · 2026-03-10T17:27:32 1773163652

I am kind of doing that now. I put Kimi K2.5 into a Ralph Loop to make a Screeps.com AI. So far its been awful at it. If you want to track its progress, I have its dashboard at https://balsa.info

jetbalsa · 2026-03-01T00:31:35 1772325095

What where you using it for? claude is really good at agentic stuff, Pure coding, I can see codex being better, but for the entire workflow, I'm not sure

virgildotcodes · 2026-03-01T00:37:34 1772325454

I use Codex purely for coding, and that's 90% of my use case for AI in general (10% using ChatGPT web for misc stuff). I pop out to Opus in Claude Code regularly to try to stay up on their relative performance, but so far the primary value I've been able to derive from CC is as a second set of eyes for code review / poking holes in plans. For primary planning / debugging / implementation Codex outclasses it atm sadly.