Hacker Newsnew | past | comments | ask | show | jobs | submit | cadamsdotcom's commentslogin

Sorry to ask a dumb question, but, why not move the tests into the repo?

Monorepos have many benefits chiefly being able to commit atomically reduces incidental complexity from drift.

It’s good enough for Google and Facebook!


My suggestion is to do a “human summary” of what you asked the agent and what it found, maybe supply the code it generated, mayyybe.. but mostly recommend they not review it but instead, the reviewer give the PR to their own agent to do a reimplementation.

Since code is cheap now, why not replace reviewing with reimplementing!


Thanks for that little rabbit hole!

Learned about a bunch of new emoji!


Nice work. But it only goes half way.

It should loop the LLM’s results back on itself repeatedly, behind the scenes, until its writing is free of signs of slop. After your quality gates pass and the result is presented, it’d be cool to then see a visualization of each of the agent’s drafts that the user can page through to watch how the writing was gradually incrementally improved by the model!

No need to keep a human in the writing-improvement loop. Just show it when it’s slop free.


There is a one shot rewrite function now, but you're right, even when asking the LLM avoid some of the patterns, it will stubbornly repeat them. It's a bit more reliable with smaller fragments of text.

I am saying, keep reflecting its attempts back on itself. Over and over again, dozens of times if needed. We’ve seen it - any aligned model wants only to achieve its goal. But it does need to see all of its past attempts and where and why each attempt got a failing grade. That’s just a standard conversation history.

It might spit back the same thing the first round. But after the first time it received the exact same feedback for saying the same thing, the model will realize it’s in a deterministic sandbox and try something different. You need to give it all of the conversation including its past attempts as context. If it tries the exact same wording that’s okay, it’s just one more invisible round of back-and-forth. The model is going to rediscover how to work with the harness every time, but that’s not your users’ problem because you’ve hidden that wrinkly bit behind the automation - they just see “model did 10 drafts and here’s the result - would you like to view the result or page through the drafts?”

What I am describing is exactly what a human would do, it is just automated and thus, getting to a good result becomes insanely faster.


> he’s making .. mistakes

Claude and other LLMs do not have a gender; they are not a “he”. Your LLM is a pile of weights, prompts, and a harness; anthropomorphising like this is getting in the way.

You’re experiencing what happens when you sample repeatedly from a distribution. Given enough samples the probability of an eventual bad session is 100%.

Just clear the context, roll back, and go again. This is part of the job.


Why be so upset at someone using pronouns with a LLM?

You are being downvoted but I actually agree with your statement.

Maybe like, don’t do that?

Moving the folder you’re in out from under yourself is okay if you know you did it - but if you don’t, you’re gonna get confused :) And so is an agent!


I'm very used to tools which keep track of connections to documents based on internal IDs, not folder structure. It seems primitive to be so brittle.

It seems even more stupid that it was so hard to get Codex to fix this for me. I managed to get it to solve the problem, but not before it got itself in this crazy loop of restarting the app, wanting to quit, quitting even if I canceled the quit dialog, and restarting over and over. I was able to reboot my machine and it had sorted out the missing references to most of the projects, but wow.


In the gaps between the tops of the lines and the bottoms of the other lines ;)

Why not use Claude to write yourself a daily briefing, and deliver it to your email or someplace you’ll receive it?

That's what I ended up doing, made my daily briefing a Cron job and running it via "claude -p". Wired it up to make a podcast, with MCP tools I made to create an MP3 with OpenAI, another to upload it to one of my sites with an updated RSS feed, so I can listen in the AntennaPod podcast app each morning.

Nice. Even better would be having your agent write code for the deterministic bits and telling the agent it should “invoke the script called blah” to do uploads (or whatever you want to have happen deterministically).

Yep, I agree! My MCP tools are local compiled Go binaries, and the tool that uploads my podcast is actually a local Go CLI that Claude calls. Claude's main role / intelligence is in evaluating which of the morning's HN & Lobsters news is most relevant to me specifically, and writing the podcast script. I'm all for deterministic tools, and it saves on tokens too.

One advantage of splitting it into MCP tools though, one day I'd run out of pre-paid OpenAI TTS credit, and Claude was smart enough to try using Mistral TTS instead. I could have done that fallback deterministically too, but it wasn't something I'd thought of yet.

I once had a friend tell me they'd got their AI to tell them the weather every morning... and the thought of that poor AI, web researching Weather APIs & writing a new python script to call the API every morning, instead of just doing the research once and making it a binary (or even just a curl line)... drove me crazy. All that wasted time and compute. Some people just like to watch tokens burn.


Raw AI output is dangerous to just use. And yet - we do, because that’s 2026’s state of the art.

It’s like your raw thoughts - you wouldn’t act on them, you’d pass them through many filters you’ve designed over the course of your life.

This is a tricky harness engineering problem - but it’s solvable. We need deterministic shells around these things.

Don’t use raw AI output. Paste it back with feedback, or build tools and scripts that automate that self-reflection loop. Don’t ask it for financial advice; instead have it build - and then populate - financial models. Request that it use symbolic modeling to reason about problems (this nudge was all it took for Gemini to ace the “walk 50m to the car wash” question.) Ask it to contemplate its essay in the context of Wikipedia’s “signs of AI writing” article and clean it up a bit. Have it build you a tool that automates the “clean this essay up” step for you, so you only see cleaned up essays.

We all refine our work until it’s ready - the culture of AI use needs to mirror that.


Whoever designed that test forgot that kinetic energy is proportional to velocity squared.

A collision at 50 km/h is going to have roughly 1% the force of one at 500km/h.


Apparently incapable of reading the description on a video.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: