im running some experiments on this but based on what i have seen on my own personal data - I dont think this is true
"given that Opus 4.7 on Low thinking is strictly better than Opus 4.6 on Medium, etc., etc.”
Opus 4.7 in general is more expensive for similar usage. Now we can argue that is provides better performance all else being equal but I haven’t been able to see that
yeah thats is my biggest issue - im okay with paying 20-30% more but what is the ROI? i dont see an equivalent improvement in performance. Anthropic hasnt published any data around what these improvements are - just some vague “better instruction following"
Its enshittificating real fast. They'll just keep releasing model after model, more expensive than the last, marginal gains, but touted as "the next thing". Evangelists will say that they're afraid, it's the future, in 6 months it's all over. Anthropic will keep astroturfing on Reddit. CEOs will make even more outlandish claims.
You raised a good point, what's a good metric for LLM performance? There's surely all the benchmarks out there, but aren't they one and done? Usually at release? What keeps checking the performance of those models. At this point it's just by feel. People say models have been dumbed down, and that's it.
I think the actual future is open source models. Problem is, they don't have the huge marketing budget Anthropic or OpenAI does.
This is most likely trajectory I fear. It reminds me a lot of Oracle, where they rebrand and reskin products just to change pricing/marketing without adding anything.
The other thing is most people don't really care about price per token or whatever but how much it will cost to execute (successfully) a task they want.
It doesn't matter if a model is e.g. 30% cheaper to use than another (token-wise) but I need to burn 2x more tokens to get the same acceptable result.
It was on the higher end of Anthropics range - closer to 30-40% more tokens
https://www.claudecodecamp.com/p/i-measured-claude-4-7-s-new...
reply