Hacker Newsnew | past | comments | ask | show | jobs | submit | lambda's commentslogin

This is my biggest problem with the promises of agentic coding (well, there are an awful lot of problems, but this is the biggest one from an immediate practical perspective).

One the one hand, reviewing and micromaning everything it does is tedious and unrewarding. Unlike reviewing a colleague's code, you're never going to teach it anything; maybe you'll get some skills out of it if you finds something that comes up often enough it's worth writing a skill for. And this only gets you, at best, a slight speedup over writing it yourself, as you have to stay engaged and think about everything that's going on.

Or you can just let it grind away agentically and only test the final output. This allows you to get those huge gains at first, but it can easily just start accumulating more and more cruft and bad design decisions and hacks on top of hacks. And you increasingly don't know what it's doing or why, you're losing the skill of even being able to because you're not exercising it.

You're just building yourself a huge pile of technical debt. You might delete your prod database without realizing it. You might end up with an auth system that doesn't actually check the auth and so someone can just set a username of an admin in a cookie to log in. Or whatever; you have no idea, and even if the model gets it right 95% of the time, do you want to be periodically rolling a d20 and if you get a 1 you lose everything?


That sounds high for a Strix Halo with a dense 27b model. Are you talking about decode (prompt eval, which can happend in parallel) or generation when you quote tokens per second? Usually if people quote only one number they're quoting generation speed, and I would be surprised if you got that for generation speed on a Strix Halo.

Some people would rather not hand over all of their ability to think to a single SaaS company that arbitrarily bans people, changes token limits, tweaks harnesses and prompts in ways that cause it to consume too many tokens, or too few to complete the task, etc.

I don't use any non-FLOSS dev tools; why would I suddenly pay for a subscription to a single SaaS provider with a proprietary client that acts in opaque and user hostile ways?


I think, we're seeing very clearly, the problem with the Cloud (as usual) is it locks you into a service that only functions when the Cloud provides it.

But further, seeing with Claude, your workflow, or backend or both, arn't going anywhere if you're building on local models. They don't suddenly become dumb; stop responding, claim censorship, etc. Things are non-determinant enough that exposing yourself to the business decisions of cloud providers is just a risk-reward nightmare.

So yeah, privacy, but also, knowing you don't have to constantly upgrade to another model forced by a provider when whatever you're doing is perfectly suitable, that's untolds amount of value. Imagine the early npm ecosystem, but driven now by AI model FOMO.


Their latest, Qwen3.6 35B-A3B is quite capable, and fast and small enough I don't really feel constrained running it locally. Some of the others that I've run that seem reasonably good, like Gemma 4 31B and Qwen3.5 122B-A10B just feel a bit too slow, or OOM my system too often, or run up on cache limits so spend a lot of time re-processing history. But the latest Qwen3.6 is both quite strong, and lightweight enough that it feels usable on consumer hardware.

It's a lighthearted, fun, visual benchmark that's not part of the standard benchmarks; and at least traditionally, it was not something that the labs trained on so it was something of a measure of how well the intelligence of the model generalized. Part of the idea of LLMs is that they pick up general knowledge and reasoning ability, beyond any tasks that they are specifically trained for, from the vast quantity of data that they are trained on.

Of course, a while back there was a Gemini release that I believe specifically called out their ability to produce SVGs, for illustration and diagramming purposes. So it's not longer necessarily the case that the labs aren't training on generating SVGs, and in fact, there's a good chance that even if they're not doing so explicitly, the RLVR process might be generating tasks like that as there is more and more focus on frontend and design in the LLM space. So while they might not be specifically training for a pelican riding a bicycle, they may actually be training on SVG diagram quality.


Gah, the writing on this is so painful to read, it feels like this was most likely written by an LLM.

The writing style is so unclear, it's hard to figure out one of the key points: it mentions that Gemini doesn't use a distinct user-agent for its grounding. It doesn't mention whether it actually hit the endpoint during the test, though it kind of implies that with "Silence from Google is not evidence of no fetch." Uh, if there are no requests coming in live, that means no fetch, it's using a cache of your site.

It makes a difference whether it fetches a page live, or whether it's using a cached copy from a previous crawl; that tells you something about how up-to-date answers are going to be from people asking questions about your website from Gemini. But I guess the LLM writing this article just wanted to make things sound punchy an impressive, not actually communicate useful information.

Anyhow, LLM marketing spam from an LLM marketing spam company. Bleh.


I haven't seen an LLM write this poorly yet (at least not passed off as good writing). This seems more like a person that used AI to organize things, but then didn't want it to seem like it was written by AI, so they rewrote it themselves. I think the problem here is just a genuinely unskilled author, and likely not a native English speaker judging by some of the awkward phrasing.

I did use AI to organize my ideas but I didn't think it was that bad, I'll modify and make it easier to read.

Anyway, in my test I saw zero requests from any Google UA after multiple Gemini and AI mode prompts that should have triggered grounding, so the working interpretation is that Gemini served from its own index/cache rather than doing a live provider-side fetch. The original phrasing was fuzzier than it should have been.


If this weren't on HN I wouldn't have given this more than a few seconds of reading before switching away. Some examples of phrasing that triggers me:

> attributing hits was a grep, not a guess > values below are copied from the probe’s log file, not paraphrased > a User-agent: Claude-User disallow is the live control > Only Claude-User is the user-initiated retrieval signal

I could go on and on but I won't. Phrasing aside, the text is too structured with many sections and subsections when the intent was clearly more narrative. "I was curious about X and did Y and I am going to tell you about it."

Signals that suggest a human who cares would be: use of the first-person; demonstrated curiosity, humility, and uncertainty; inline hyperlinks; and any kind of personality or opinion.

"Idiolect" is both subtle and distinct: the choice of vocabulary, grammar, phrasing and colloquial metaphors will vary in kind and frequency for everyone like an intellectual signature. You can sometimes tell if someone has been reading too much of a particular author recently just because of the way the author's choice of vocabulary bleeds into their own speech patterns. Sometimes it's a permanent influence.

I wonder if reading so much LLM stuff lately has affected my idiolect and that I write (or worse, think) more machine-like than before...


> I wonder if reading so much LLM stuff lately has affected my idiolect and that I write (or worse, think) more machine-like than before...

Totally of topic ofc, but I always get triggered by the claim that llms are "machine-like". I'm aware it's a total pet peeve and a lil irrational, but "machine-like" would imply to me that it's thinking like a machine, which in turn implies machine intelligence - which in turn implies they're doing something which they aren't.

I'm not trying to undersell their capabilities. Used well they're able to do a lot of things. But the way they achieve it is by mimicking human dialogue and rhetoric processes to facilitate this process. That's in my opinion anything but machine intelligence. I struggle finding an applicable word for it though


Sometimes when we point the moon to people they prefer to discuss at length about the finger.

Don't worry.


If you point six index fingers and a bifurcated thumb at the moon, then many people will worry.

I had to quit after a couple of paragraphs, I cant read such AI slop anymore :(

The article has now been been semi-protected to prevent vandalism by anonymous users.

Or just because mistakes are part of the distribution that it's trained on? Usually the averaging effect of LLMs and top-k selection provides some pressure against this, but occasionally some mistake like this might rise up in probability just enough to make the cutoff and get hit by chance.

I wouldn't really ascribe it to any "attempt to seem more human" when "nondeterministic machine trained on lots of dirty data" is right there.


Sure, but if that were the case why has it gotten worse recently? I would expect it to be as a result of cost optimization or tradeoffs in the model. I suppose it could be an indicator of the exhaustion of high quality training data or model architecture limitation. But this specific example, revel vs reveal, is almost like going back to GPT-2 reddit errors.

I also don’t want to pretend there is no incentive for AI to seem more human by including the occasional easily recognized error.


Or just the models are getting bigger and better at representing the long tail of the distribution. Previously errors like this would get averaged away more often; now they are capable of modelling more variation, and so are picking up on more of these kinds of errors.

That makes sense, but what is the solution?

There is a certain amount of it which is the randomness of an LLM. You really want to ask most questions like this several times.

That said, I have several local models I run on my laptop that I've asked this question to 10-20 times while testing out different parameters that have answered this consistently correctly.


I've run several local models that get this right. Qwen 3.5 122B-A10B gets this right, as does Gemma 4 31B. These are local models I'm running on my laptop GPU (Strix Halo, 128 GiB of unified RAM).

And I've been using this commonly as a test when changing various parameters, so I've run it several times, these models get it consistently right. Amazing that Opus 4.7 whiffs it, these models are a couple of orders of magnitude smaller, at least if the rumors of the size of Opus are true.


Does Gemma 4 31B run full res on Strix or are you running a quantized one? How much context can you get?

I'm running an 8 bit quant right now, mostly for speed as memory bandwidth is the limiting factor and 8 bit quants generally lose very little compared to the full res, but also to save RAM.

I'm still working on tweaking the settings; I'm hitting OOM fairly often right now, it turns out that the sliding window attention context is huge and llama.cpp wants to keep lots of context snapshots.


I had a whole bunch of trouble getting Gemma 4 working properly. Mostly because there aren't many people running it yet, so there aren't many docs on how to set it up correctly.

It is a fantastic model when it works, though! Good luck :)


Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: