That's a lame attitude. There are local models that are last year's SOTA, but that's not good enough because this year's SOTA is even better yet still...
I've said it before and I'll say it again, local models are "there" in terms of true productive usage for complex coding tasks. Like, for real, there.
The issue right now is that buying the compute to run the top end local models is absurdly unaffordable. Both in general but also because you're outbidding LLM companies for limited hardware resources.
You have a $10K budget, you can legit run last year's SOTA agentic models locally and do hard things well. But most people don't or won't, nor does it make cost effective sense Vs. currently subsidized API costs.
I completely see your point, but when my / developer time is worth what it is compared to the cost of a frontier model subscription, I'm wary of choosing anything but the best model I can. I would love to be able to say I have X technique for compensating for the model shortfall, but my experience so far has been that bigger, later models out perform older, smaller ones. I genuinely hope this changes through. I understand the investment that it has taken to get us to this point, but intelligence doesn't seem like it's something that should be gated.
Right; but every major generation has had diminishing returns on the last. Two years ago the difference was HUGE between major releases, and now we're discussing Opus 4.6 Vs. 4.7 and people cannot seem to agree if it is an improvement or regression (and even their data in the card shows regressions).
So my point is: If you have the attitude that unless it is the bleeding edge, it may have well not exist, then local models are never going to be good enough. But truth is they're now well exceeding what they need to be to be huge productivity tools, and would have been bleeding edge fairly recently.
I feel like I'm going to have to try the next model. For a few cycles yet. My opinion is that Opus 4.7 is performing worse for my current work flow, but 4.6 was a significant step up, and I'd be getting worse results and shipping slower if I'd stuck with 4.5. The providers are always going to swear that the latest is the greatest. Demis Hassabis recently said in an interview that he thinks the better funded projects will continue to find significant gains through advanced techniques, but that open source models figure out what was changed after about 6 months or so. We'll see I guess. Don't get me wrong, I'd love to settle down with one model and I'd love it to be something I could self host for free.
> I completely see your point, but when my / developer time is worth what it is compared to the cost of a frontier model subscription, I'm wary of choosing anything but the best model I can.
Don't you understand that by choosing the best model we can, we are, collectively, step by step devaluating what our time is worth? Do you really think we all can keep our fancy paychecks while keep using AI?
Do you think if you or me stopped using AI that everyone else will too? We're still what we always were - problem solvers who have gained the ability to learn and understand systems better that the general population, communicate clearly (to humans and now AIs). Unfortunately our knowledge of language APIs and syntax has diminished in value, but we have so many more skills that will be just as valuable as ever. As the amount of software grows, so will the need for people who know how to manage the complexity that comes with it.
> Unfortunately our knowledge of language APIs and syntax has diminished in value, but we have so many more skills that will be just as valuable as ever.
There were always jobs that required those "many more skills" but didn't require any programming skills.
We call those people Business Analysts and you could have been doing it for decades now. You didn't, because those jobs paid half what a decent/average programmer made.
Now you are willingly jumping into that position without realising that the lag between your value (i.e. half your salary, or less) would eventually disappear.
First, making sure to offer an upvote here. I happen to be VERY enthusiastic about local models, but I've found them to be incredibly hard to host, incredibly hard to harness, and, despite everything, remarkably powerful if you are willing to suffer really poor token/second performance...
As an American, I always hear all these weird stories about New York and its subway system. All the random busker type nonsense, the petty crime and the “mugger wallet” type jokes. Not to mention the major crimes that make the news.
I’d rather not deal with it? Yes I know roads are dangerous. I’d still rather not deal with the expected culturally imposed insanity that the Japanese curiously seem to lack.
> All the random busker type nonsense, the petty crime and the “mugger wallet” type jokes.
Most of this is stories. Yeah there are buskers but tbh I like buskers. Music in the public square is a plus not a minus even if it's not my personal preference of music.
Subway crime rates are around 2-4 incidents per million rides. There was a spike during covid and it started to rapidly trend down afterwards. That corresponds with economic desperation during that period pretty cleanly.
But that 2-4 incidents per million rides is roughly comparable to the crime rates at gas stations, etc. The difference is that density is lower so you just see it less often. It happens just about as frequently but you are less likely to witness it because you are less likely to be present when it happens to somebody else at a gas station.
> I’d still rather not deal with the expected culturally imposed insanity that the Japanese curiously seem to lack.
Trust me Japan has just as much of an issue with crime on rail. Arguably they have higher rates but the Japanese police often just don't consider sexual harassment or sexual assault a serious crime and would rather brush it under the rug or otherwise deal with it outside the criminal system to avoid harming the abuser. (ex: an incident that I'm familiar with: "oh we gave the guy who assaulted you on the train your address so they could mail you a hand written apology note instead of charging them with assault")
And the "wacky in your face" crime (intoxicated, mental illness, etc) is still very much an issue in Japan but it's cracked down on by police in places that tourists frequently visit during the day and otherwise everyone just expects it so people who live there don't really mention it to tourists.
I mean hell look at Shibuya Meltdown for some of the more mild "funny" examples.
The only real difference between the NYC metro and the Japanese metro is that it's louder because there's not a social norm to limit talking on the train (until people are drunk ofc). Otherwise it's all the same shit and you see it all when you start commuting.
The weird stories, about anything, are nonsense; sensationalized to either be emotoinally compelling or even active disinformation to serve some political end (especially about American cities, especially about NYC.)
It's just induced fear. Just go to NY and ride the subway. Millions do all the time without any problems, without a second thought. It's really no problem and amazingly convenient. (Busking is people playing music.)
Of course some crime occurs among millions of people but so do lottery grand prizes and heart attacks. I've been on many subway rides without experiencing one crime or even seeing one, and much other public transit.
And when you do, you'll know what to think of the stories and people who tell them.
> It's extremely common for there to be human shit in the train cars, and lunatics going nuts
Where does that come from? Not from your experience. You've never been on NY subways, clearly.
I've never seen feces - and anyway, how could you tell if it's from a dog? Did you examine it? Take it home and test it? It's one of the stories that maybe is slightly plausible, and which yields such strong disgust that rationality is overwhelmed and it makes a sensation - perfectly constructed misinformation or urban myth. Like waking up in a bathtub with a kidney missing.
'Lunatics' is such a loaded (and hateful) word you'll have to specify what you mean, but the occasional person talking to themself is harmless and completely uninterested in you (thus the conversation with themself) - I have never had any problem with such people on public transit or elsewhere. They are the most vulnerable people and compassion is the appropriate response.
As I wrote above, the stories are nonsense and it's induced fear.
I actually am speaking from experience, I saw both of those things my first week in New York. It's really not uncommon, I find it hard to believe that you've never run into shit/barf, usually when a car pulls up that has nobody in it, that's what's in there.
And this is all to say nothing about the decrepit state of the stations and cars themselves.
I've also been to Japan and experienced their trains. It's in such a different league that it's almost comedy.
> It's really not uncommon, I find it hard to believe that you've never run into shit/barf, usually when a car pulls up that has nobody in it, that's what's in there.
NGL this isn't surprising on Japanese trains either. Especially around last train. It's not super common but you see it from time to time and you just use a different car and report it to the staff next time you see someone.
Yep, and I just made a recommendation that was essentially "never enable Opus 4.7" to my org as a direct result. We have Opus 4.6 (3x) and Opus 4.5 (3x) enabled currently. They are worth it for planning.
At 7.5x for 4.7, heck no. It isn't even clear it is an upgrade over Opus 4.6.
> Over the coming weeks, Opus 4.7 will replace Opus 4.5 and Opus 4.6 in the model picker for Copilot Pro+.
> This model is launching with a 7.5× premium request multiplier as part of promotional pricing until April 30th
TBF, it's a rumour that they are switching to per-token price in May, but it's from an insider (apparently), and seeing how good of a deal the current per-request pricing is, everyone expects them to bump prices sometime soon or switch to per-token pricing.
The per-request pricing is ridiculous (in a good way, for the user). You can get so much done on a single prompt if you build the right workflow. I'm sure they'll change it soon
Yeah it seems insane that it's priced this way to me too. Using sonnet/opus through a ~$40 a month copilot plan gives me at least an order of magnitude more usage than a ~$40 a month claude code plan (the usage limits on the latter are so low that it's effectively not a viable choice, at least for my use cases).
The models are limited to 160k token context length but in practice that's not a big deal.
Unless MS has a very favourable contract with Anthropic or they're running the models on their own hardware there's no way they're making money on this.
Yeah, you can even write your own harness that spawns subagents for free, and get essentially free opus calls too. Insane value, I'm not at all surprised they're making changes. Oh well. It was a pain in the ass to use Copilot since it had a slightly different protocol and oauth so it wasn't supported in a lot of tools, now I'm going to go with Ollama cloud probably, which is supported by pretty much everything.
Manage the budget not the impl. Top down decisions like "use a cheap model" risk optimize for the wrong things. If we lose 90% cache hit on the expensive models to context switch to a cheap one, there's no savings. Set the budget, let the devs optimize.
I don't know how you guys are not seeing 4.7 as an upgrade, it just does so much more, so much better. I guess lower complexity tasks are saturated though.
Thanks, interesting. Does this make it more surprising that the other benchmarks have improved? I'm not sure I understand the benchmarks well enough - but I'm wondering whether with agentic workflows it's possible to get away with a smaller more focussed context (and hence lower cost) whilst achieving the same or better performance, because of agentic model's ability to decide what the put in context as they work
People have tried to run Qwen3-235B-A22B-Thinking-2507 on 4x $600 used, Nvidia 3090s with 24 GB of VRAM each (96 GB total), and while it runs, it is too slow for production grade (<8 tokens/second). So we're already at $2400 before you've purchased system memory and CPU; and it is too slow for a "Sonnet equivalent" setup yet...
You can quantize it of course, but if the idea is "as close to Sonnet as possible," then while quantized models are objectively more efficient they are sacrificing precision for it.
So next step is to up that speed, so we're at 4x $1300, Nvidia 5090s with 32 GB of VRAM each (128 GB), or $5,200 before RAM/CPU/etc. All of this additional cost to increase your tokens/second without lobotomizing the model. This still may not be enough.
I guess my point is: You see this conversation a LOT online. "Qwen3 can be near Sonnet!" but then when asked how, instead of giving you an answer for the true "near Sonnet" model per benchmarks, they suddenly start talking about a substantially inferior Qwen3 model that is cheap to run at home (e.g. 27B/30B quantized down to Q4/Q5).
The local models absolutely DO exist that are "near Sonnet." The hardware to actually run them is the bottleneck, and it is a HUGE financial/practical bottleneck. If you had a $10K all-in budget, it isn't actually insane for this class of model, and the sky really is the limit (again to reduce quantization and or increase tokens/second).
PS - And electricity costs are non-trivial for 4x 3090s or 4x 5090s.
Qwen3.5-35B-A3B is reported to perform slightly better than the model you mentioned.
It runs fine but non-optimal on a single 3090 with even 131072 tokens of context , and due to the hybrid attention architecture, the memory usage and compute scale rather less drastically than ctx^2. I've had friends with smaller cards still getting work out of it. Generation is at around 20 tokens/sec on that 3090 (without doing anything special yet) . You'll need enough DRAM to hold the bits of the model that don't fit. Nothing to write home about, but genuinely usable in a pinch or for tasks that don't need immediate interactivity.
It's the first local model that passes my personal kimbench usability benchmark at least. Just be aware that it is extremely verbose in thinking mode. Seems to be a qwen thing.
(edit: On rechecking my numbers; I now realize I can possibly optimize this a lot better)
With respect, this isn't "new data" it is an anecdote. And it kind of represents exactly the problem I was talking about above:
- Qwen is near Sonnet 4.5!
- How do I run that?
- [Starts talking about something inferior that isn't near Sonnet 4.5].
It is this strange bait/switch discussion that happens over and over. Least of all because Sonnet has a 200K context window, and most of these ancdotes aren't for anywhere near that context size.
You're not wrong; but... imho it's closer to Sonnet 4.0 [1] on my personal benchmark [2]. And I HAVE run it at just over 200Ktoken context, it works, it's just a bit slow at that size. It's not great, but ... usable to me? I used Sonnet 4.0 over api for half a year or so before, after all.
Only way to know if your own criteria are now matched -or not yet- is to test it for yourself with your own benchmark or what have you.
And it does show a promising direction going forward: usable (to some) local models becoming efficient enough to run on consumer hardware.
[1] released mid-2025
[2] take with salt - only tests personal usability
+ Note that some benchmarks do show Qwen3.5-35B-A3B matching Sonnet 4.5 (released later last year); but I treat those with the same skepticism you do , clearly ;)
> The hardware to actually run them is the bottleneck, and it is a HUGE financial/practical bottleneck.
That's unsurprising, seeing as inference for agentic coding is extremely context- and token-intensive compared to general chat. Especially if you want it to be fast enough for a real-time response, as opposed to just running coding tasks overnight in a batch and checking the results as they arrive. Maybe we should go back to viewing "coding" as a batch task, where you submit a "job" to be queued for the big iron and wait for the results.
You can still use OpenClaw on their API pricing tier as much as you want. What they did is not allow subscriptions to be used to power automated third-party workloads, including OpenClaw.
Now, is their messaging around this confusing? Absolutely. The whole thing has been handled shambolically. Everyone knows that they lack the compute to keep up, and likely have lower margins on subscriptions than API; but they cannot just say that because investors may be skittish.
Because you'll slowly start building the individual pieces of the database over the file system, until you've just recreated a database. Database didn't spawn out of nothing, people were writing to raw files on disk, and kept on solving the same issues over and over (data definitions, indexes, relations, cache/memory management, locks, et al).
So your question is: Why does the industry focus on reusable solutions to hard problems, over piece-meal recreating it every project? And when phased in that way, the answer is self-evident. Productivity/cost/ease.
Could you go into more details about why their "harness sucks?" This feels like a shared conclusion, but I've used several and theirs is better than many.
I generally agree that the harness isn't good, but it works and gets the job done and that seems to be the singular goal of the top 4 or 5 companies building them.
We saw what Claude Code looks like inside, and it's objectively bad-to-mediocre work, but the takeaway seemed to be 'yeah but it works and they've got crazy revenue'.
That's where we're at. The harness is kind of buggy. The LLM still wanders and cycles in it sometimes. It's a monolithic LLM herding machine. The underlying model is awesome and the harness works well enough to make it super effective.
We can do so much better but we could also do worse. It's a turbulent time. I'm not super pleased with it all the time, but it's hard to criticize in many ways. They're doing a good job under the circumstances.
I see it kind of like they're at war. If they slow down to perfect anything, they will begin to lose battles, and they will lose ground. It's a highly contentious space. The harness isn't as good as it could be under better circumstances, but it's arguably a necessary trade off Anthropic needs to make.
I've been using OpenCode until yesterday (with some plugin to let me use their model until they implemented what it seems very sophisticated detection to reject you).
It just has a sane workflow it's easy to use, doesn't bother you with 1000 questions if you allow this or that to run and generally it feels like the model is dumber and makes more mistakes since yesterday since I have to use claude code.
Then you can publish the public Code Signing certificate for download/import or publish it through WinGet.
Using Azure Trusted Signing or any other certificate vendor does not guarantee that a binary is 100% trustworthy, it just means someone put their name on it.
I've said it before and I'll say it again, local models are "there" in terms of true productive usage for complex coding tasks. Like, for real, there.
The issue right now is that buying the compute to run the top end local models is absurdly unaffordable. Both in general but also because you're outbidding LLM companies for limited hardware resources.
You have a $10K budget, you can legit run last year's SOTA agentic models locally and do hard things well. But most people don't or won't, nor does it make cost effective sense Vs. currently subsidized API costs.
reply