More

wild_egg · 2026-04-26T01:53:06 1777168386

Technically, I think this is meant to develop Coalton, which is also statically typed and incredibly effective as a language for agents. All those ergonomic benefits that humans enjoy also allow AIs to develop lisp systems quite rapidly and robustly.

wild_egg · 2026-04-23T19:51:36 1776973896

At the core, they're really very simple [1]. Run LLM API calls in a loop with some tools.

From there, you can get much fancier with any aspect of it that interests you. Here's one in Bash [2] that is fully extensible at runtime through dynamic discovery of plugins/hooks.

[1] https://ampcode.com/notes/how-to-build-an-agent

[2] https://github.com/wedow/harness

wild_egg · 2026-04-21T13:17:56 1776777476

Inference, in and of itself, can't be completely unprofitable. Unless you're purely talking about Anthropic?

But

> If you want LLMs to continue to be offered we have to get to a point where the providers are taking in more money than they are spending hosting them

Suggests you just mean in general, as a category, every provider is taking a loss. That seems implausible. Every provider on OpenRouter is giving away inference at a loss? For what purpose?

bandrami · 2026-04-22T02:25:51 1776824751

For the same reason that Amazon operated at a loss for two decades and Uber operated at a loss for a decade and a half. The problem is the free money hose isn't running anymore.

wild_egg · 2026-04-18T17:39:27 1776533967

A week of downtime every decade I think still works out to a higher uptime than I've been getting from parts of GitHub lately. So I'd consider that a win.

wild_egg · 2026-04-16T15:03:19 1776351799

Where did you see a haiku comparison? Haiku 4.5 was my daily driver for a month or so before Opus 4.5 dropped and would be unreasonably happy if a local model can give me similar capability

daemonologist · 2026-04-16T15:39:13 1776353953

I didn't see a direct comparison, but there's some overlap in the published benchmarks:

                           │ Qwen 3.6 35B-A3B │ Haiku 4.5               
   ────────────────────────┼──────────────────┼──────────────────────── 
    SWE-Bench Verified     │ 73.4             │ 66.6                    
   ────────────────────────┼──────────────────┼──────────────────────── 
    SWE-Bench Multilingual │ 67.2             │ 64.7                    
   ────────────────────────┼──────────────────┼──────────────────────── 
    SWE-Bench Pro          │ 49.5             │ 39.45                   
   ────────────────────────┼──────────────────┼──────────────────────── 
    Terminal Bench 2.0     │ 51.5             │ 61.2 (Warp), 27.5 (CC)  
   ────────────────────────┼──────────────────┼──────────────────────── 
    LiveCodeBench          │ 80.4             │ 41.92

These are of course all public benchmarks though - I'd expect there to be some memorization/overfitting happening. The proprietary models usually have a bit of an advantage in real-world tasks in my experience.

coder543 · 2026-04-16T15:34:27 1776353667

Artificial Analysis hasn't posted their independent analysis of Qwen3.6 35B A3B yet, but Alibaba's benchmarks paint it as being on par with Qwen3.5 27B (or better in some cases).

Even Qwen3.5 35B A3B benchmarks roughly on par with Haiku 4.5, so Qwen3.6 should be a noticeable step up.

https://artificialanalysis.ai/models?models=gpt-oss-120b%2Cg...

No, these benchmarks are not perfect, but short of trying it yourself, this is the best we've got.

Compared to the frontier coding models like Opus 4.7 and GPT 5.4, Qwen3.6 35B A3B is not going to feel smart at all, but for something that can run quickly at home... it is impressive how far this stuff has come.

naasking · 2026-04-17T04:01:10 1776398470

Qwen models commonly get accused of benchmaxxing though. Just something to keep in mind when weighing the standard benchmarks.

coder543 · 2026-04-17T11:41:18 1776426078

Every model release gets accused of that, including the flagship models.

naasking · 2026-04-17T12:36:48 1776429408

Less so for Gemma-4 because it falls behind Qwen on benchmarks. Tests for benchmaxxing are also strongly suggestive: https://x.com/bnjmn_marie/status/2041540879165403527

coder543 · 2026-04-17T12:40:31 1776429631

No… seriously. Every model release is accused. Including Opus, GPT-5.4, whatever. And yes, including smaller models that are not the top in every benchmark.

My own experiences with Gemma 4 have been quite mediocre: https://www.reddit.com/r/LocalLLaMA/comments/1sn3izh/comment...

I would almost be tempted to call it benchmaxed if that term weren’t such a joke at this point. It is a deeply unserious term these days.

Gemma 4 is worse than its benchmarks show in terms of agentic workflows. The Qwen3.x models are much better; not benchmaxed. I have tested this extensively for my own workflows. Google really needs to release Gemma 4.1 ASAP. I really hope they’re not planning to just wait another calendar year like they did for Gemma 3 -> 4 with no intermediate updates.

And the lead author on the paper replied to that tweet to say that the scores would need to be greater than 80 to show actual contamination: https://x.com/MiZawalski/status/2043990236317851944?s=20

deaux · 2026-04-16T18:10:58 1776363058

I find Gemma 4 26B A4B better than Haiku 4.5 and that's smaller than this one.

wild_egg · 2026-04-16T01:05:56 1776301556

That's really not remotely the same thing

ratsimihah · 2026-04-16T11:39:32 1776339572

Yea I have to try it and see. When Claude released Remote Control I hoped on right away and it was crap, it kept disconnecting. Tailscale + SSH was much better.

wild_egg · 2026-04-15T15:51:54 1776268314

Haven't the SQLite tests always been closed? Getting access to them is a major reason for financially supporting them

wild_egg · 2026-04-15T15:50:06 1776268206

It only takes 20 minutes and $200 to hack a closed source one too though. LLMs are ludicrously good at using reverse engineering tools and having source available to inspect just makes it slightly more convenient.

keeda · 2026-04-15T18:13:20 1776276800

Very true, but that is still a meaningfully higher cost at scale. If, as people are postulating post-Mythos, security comes down to which side spends more tokens, it is a valid strategy to impose asymmetric costs on the attacker.

NetMageSCW · 2026-04-16T19:14:40 1776366880

A little harder when you don’t have the source or the binaries.

wild_egg · 2026-04-15T15:47:29 1776268049

That's exactly the message I got from the video

wild_egg · 2026-04-15T00:07:30 1776211650

I hate the feeling of playing roulette with my account every time I use their tools.

Since they refuse to actually provide definitive rules or policies, I have fully moved off their models and actively encourage all the other devs I know to do the same.