Technically, I think this is meant to develop Coalton, which is also statically typed and incredibly effective as a language for agents. All those ergonomic benefits that humans enjoy also allow AIs to develop lisp systems quite rapidly and robustly.
At the core, they're really very simple [1]. Run LLM API calls in a loop with some tools.
From there, you can get much fancier with any aspect of it that interests you. Here's one in Bash [2] that is fully extensible at runtime through dynamic discovery of plugins/hooks.
Inference, in and of itself, can't be completely unprofitable. Unless you're purely talking about Anthropic?
But
> If you want LLMs to continue to be offered we have to get to a point where the providers are taking in more money than they are spending hosting them
Suggests you just mean in general, as a category, every provider is taking a loss. That seems implausible. Every provider on OpenRouter is giving away inference at a loss? For what purpose?
For the same reason that Amazon operated at a loss for two decades and Uber operated at a loss for a decade and a half. The problem is the free money hose isn't running anymore.
A week of downtime every decade I think still works out to a higher uptime than I've been getting from parts of GitHub lately. So I'd consider that a win.
Where did you see a haiku comparison? Haiku 4.5 was my daily driver for a month or so before Opus 4.5 dropped and would be unreasonably happy if a local model can give me similar capability
These are of course all public benchmarks though - I'd expect there to be some memorization/overfitting happening. The proprietary models usually have a bit of an advantage in real-world tasks in my experience.
Artificial Analysis hasn't posted their independent analysis of Qwen3.6 35B A3B yet, but Alibaba's benchmarks paint it as being on par with Qwen3.5 27B (or better in some cases).
Even Qwen3.5 35B A3B benchmarks roughly on par with Haiku 4.5, so Qwen3.6 should be a noticeable step up.
No, these benchmarks are not perfect, but short of trying it yourself, this is the best we've got.
Compared to the frontier coding models like Opus 4.7 and GPT 5.4, Qwen3.6 35B A3B is not going to feel smart at all, but for something that can run quickly at home... it is impressive how far this stuff has come.
No… seriously. Every model release is accused. Including Opus, GPT-5.4, whatever. And yes, including smaller models that are not the top in every benchmark.
I would almost be tempted to call it benchmaxed if that term weren’t such a joke at this point. It is a deeply unserious term these days.
Gemma 4 is worse than its benchmarks show in terms of agentic workflows. The Qwen3.x models are much better; not benchmaxed. I have tested this extensively for my own workflows. Google really needs to release Gemma 4.1 ASAP. I really hope they’re not planning to just wait another calendar year like they did for Gemma 3 -> 4 with no intermediate updates.
Yea I have to try it and see. When Claude released Remote Control I hoped on right away and it was crap, it kept disconnecting. Tailscale + SSH was much better.
It only takes 20 minutes and $200 to hack a closed source one too though. LLMs are ludicrously good at using reverse engineering tools and having source available to inspect just makes it slightly more convenient.
Very true, but that is still a meaningfully higher cost at scale. If, as people are postulating post-Mythos, security comes down to which side spends more tokens, it is a valid strategy to impose asymmetric costs on the attacker.
I hate the feeling of playing roulette with my account every time I use their tools.
Since they refuse to actually provide definitive rules or policies, I have fully moved off their models and actively encourage all the other devs I know to do the same.
reply