That's like saying 'cars were better made in the 1950's because they used tons of steel'. Like they were 'heavier and more robust' - but that doesn't mean better.
Foundations are way better, more robust, especially weatherized. Windows today are like magic compared to windows 100 years ago.
What we do more poorly now is we don't use wood everywhere, aka doors, and certain kinds of workmanship are not there - like winding staircases, mouldings - but you can easily have that if you want to pay for it. That's a choice.
AI is power and leverage, it will make better things as long as it's directed by skilled operators.
I read that as "it's not worth the negative PR of being associated with AI firing minimum wage employees" compared to just paying them for a year or two.
Each european countries has various orgs, types of license. It is easier to start from SEC to integrate that data into my site, also you have to test a bit what users care about.
Wondering aloud -- this is clearly PII, but it's public information. The site would be subject to GDPR, and other rules from the EU, and folks may want to have their data hidden or removed. What would be the exposure for sourcing EU data?
People have tried to run Qwen3-235B-A22B-Thinking-2507 on 4x $600 used, Nvidia 3090s with 24 GB of VRAM each (96 GB total), and while it runs, it is too slow for production grade (<8 tokens/second). So we're already at $2400 before you've purchased system memory and CPU; and it is too slow for a "Sonnet equivalent" setup yet...
You can quantize it of course, but if the idea is "as close to Sonnet as possible," then while quantized models are objectively more efficient they are sacrificing precision for it.
So next step is to up that speed, so we're at 4x $1300, Nvidia 5090s with 32 GB of VRAM each (128 GB), or $5,200 before RAM/CPU/etc. All of this additional cost to increase your tokens/second without lobotomizing the model. This still may not be enough.
I guess my point is: You see this conversation a LOT online. "Qwen3 can be near Sonnet!" but then when asked how, instead of giving you an answer for the true "near Sonnet" model per benchmarks, they suddenly start talking about a substantially inferior Qwen3 model that is cheap to run at home (e.g. 27B/30B quantized down to Q4/Q5).
The local models absolutely DO exist that are "near Sonnet." The hardware to actually run them is the bottleneck, and it is a HUGE financial/practical bottleneck. If you had a $10K all-in budget, it isn't actually insane for this class of model, and the sky really is the limit (again to reduce quantization and or increase tokens/second).
PS - And electricity costs are non-trivial for 4x 3090s or 4x 5090s.
Qwen3.5-35B-A3B is reported to perform slightly better than the model you mentioned.
It runs fine but non-optimal on a single 3090 with even 131072 tokens of context , and due to the hybrid attention architecture, the memory usage and compute scale rather less drastically than ctx^2. I've had friends with smaller cards still getting work out of it. Generation is at around 20 tokens/sec on that 3090 (without doing anything special yet) . You'll need enough DRAM to hold the bits of the model that don't fit. Nothing to write home about, but genuinely usable in a pinch or for tasks that don't need immediate interactivity.
It's the first local model that passes my personal kimbench usability benchmark at least. Just be aware that it is extremely verbose in thinking mode. Seems to be a qwen thing.
(edit: On rechecking my numbers; I now realize I can possibly optimize this a lot better)
With respect, this isn't "new data" it is an anecdote. And it kind of represents exactly the problem I was talking about above:
- Qwen is near Sonnet 4.5!
- How do I run that?
- [Starts talking about something inferior that isn't near Sonnet 4.5].
It is this strange bait/switch discussion that happens over and over. Least of all because Sonnet has a 200K context window, and most of these ancdotes aren't for anywhere near that context size.
You're not wrong; but... imho it's closer to Sonnet 4.0 [1] on my personal benchmark [2]. And I HAVE run it at just over 200Ktoken context, it works, it's just a bit slow at that size. It's not great, but ... usable to me? I used Sonnet 4.0 over api for half a year or so before, after all.
Only way to know if your own criteria are now matched -or not yet- is to test it for yourself with your own benchmark or what have you.
And it does show a promising direction going forward: usable (to some) local models becoming efficient enough to run on consumer hardware.
[1] released mid-2025
[2] take with salt - only tests personal usability
+ Note that some benchmarks do show Qwen3.5-35B-A3B matching Sonnet 4.5 (released later last year); but I treat those with the same skepticism you do , clearly ;)
> The hardware to actually run them is the bottleneck, and it is a HUGE financial/practical bottleneck.
That's unsurprising, seeing as inference for agentic coding is extremely context- and token-intensive compared to general chat. Especially if you want it to be fast enough for a real-time response, as opposed to just running coding tasks overnight in a batch and checking the results as they arrive. Maybe we should go back to viewing "coding" as a batch task, where you submit a "job" to be queued for the big iron and wait for the results.
A machine with 128GB of unified system RAM will run reasonable-fidelity quantizations (4-bit or more).
If you ever want to answer this type of question yourself, you can look at the size of the model files. Loading a model usually uses an amount of RAM around the size it occupies on disk, plus a few gigabytes for the context window.
Qwen3.5-122B-A10B is 120GB. Quantized to 4 bits it is ~70GB. You can run a 70GB model in 80GB of VRAM or 128GB of unified normal RAM.
Systems with that capability cost a small number of thousand USD to purchase new.
If you are willing to sacrifice some performance, you can take advantage of the model being a mixture-of-experts and use disk space to get by with less RAM/VRAM, but inference speed will suffer.
reply