Hacker Newsnew | past | comments | ask | show | jobs | submit | pitched's commentslogin

Securing agents in real time and testing them for drift in CI are pretty different use-cases…

This post is an AI-generated ad, isn’t it? It’s getting too hard to tell!


You’re right that I mixed runtime enforcement with CI drift/regression testing. Different layer, different job.

I meant it as complementary, not equivalent. CrabTrap for runtime control, EvalView for deterministic testing/diffing. My bad on making it sound like a drive-by promo.


What do you think about slow rollouts for new features? Like, we think this new push notification system will be loved but let’s ship to only 1% of users in case there’s a horrible unforeseen consequence like occasional 10min delays? Dashboard goes upside down -> revert then work through logs to figure out what the hell went wrong.

What do you think of things you purchased changing over time into something you didn't purchase?

That's literally any software subscription ever.

So you're perfectly okay with repeatedly paying for a shit product, getting shat on by the company in the form of being tested for feedback, and "maybe" getting a better product in the future. Mind you, that "better" isn't necessarily better for you but more explicitly better for the company you're paying.

Sounds like someone who doesn't care about being a sheep. Or maybe someone whose salary depends on having sheep.


I am not getting shat on by them working to improve the product. Have you got a screw loose?

I think you are making far too wide-sweeping statements. I think most people here probably agree that if Anthropic drops Claude Code from the Pro plan after people have paid with the understanding that it is part of the package, that would be wrong, and they deserve to lose business over it. However, there are plenty of situations where A/B testing is entirely benign, and I would not have any problem with a company doing that testing without getting consent first. Every form of A/B testing is not done just for the gain of the company doing the testing.

Well, duh, that's precisely what's wrong with subscriptions.

That the product might improve?

In my experience, improvements are so rare compared to regressions, they might as well not exist.

> is the mechanism you'd build if you wanted to shape what a billion users read without them noticing.

A pretty large accusation at the end. That no specific word swaps were given as an example outside the first makes it feel far too clickbate than real though


I’ve been keeping them open in tmux and using either send_keys or paste buffer for communication. Using print mode and always resume last means you can’t have parallel systems going.

I just switched fully into Codex today, off of Claude. The higher usage limits were one factor but I’m also working towards a custom harness that better integrates into the orchestrator. So the Claude TOS was also getting in the way.

For a business with ten or more engineers/people-using-ai, it might still make sense to set this up. For an individual though, I can’t imagine you’d make it through to positive ROI before the hardware ages out.

It's hard to tell for sure because the local inference engines/frameworks we have today are not really that capable. We have barely started exploring the implications of SSD offload, saving KV-caches to storage for reuse, setting up distributed inference in multi-GPU setups or over the network, making use of specialty hardware such as NPUs etc. All of these can reuse fairly ordinary, run-of-the-mill hardware.

Since you need at least a few of H100 class hardware, I guess you need at least few tens of coders to justify the costs.

I see the 512GB Mac Studios aren’t for sale anymore but that was a much cheaper path

For a 30B model, you want at least 20GB of VRAM and a 24GB MBP can’t quite allocate that much of it to VRAM. So you’d want at least a 32GB MBP.

I have 24GB VRAM available and haven't yet found a decent model or combination. Last one I tried is Qwen with continue, I guess I need to spend more time on this.

Is there any model that practically compares to Sonnet 4.6 in code and vision and runs on home-grade (12G-24G) cards?

im currently running a custom Gemma4 26b MoE model on my 24gb m2... super fast and it beat deepseek, chatgpt, and gemini in 3 different puzzles/code challenges I tested it on. the issue now is the low context... I can only do 2048 tokens with my vram... the gap is slowly closing on the frontier models

It's a MoE model so I'd assume a cheaper MBP would simply result in some experts staying on CPU? And those would still have a sizeable fraction of the unified memory bandwidth available.

I haven’t tried this myself yet but you would still need enough non-vram ram available to the cpu to offload to cpu, right? This is a fully novice question, I have not ever tried it.

You're correct. If you don't have enough RAM for the model, it can still run but most of it will run on the CPU and be continuously reloaded from the SSD (through mmap).

A medium MoE like 35B can still achieve usable speeds in that setup, mind you, depending on what you're doing.


I want to bump this more than just a +1 by recommending everyone try out OpenCode. It can still run on a Codex subscription so you aren’t in fully unfamiliar territory but unlocks a lot of options.

The Codex TUI harness is also open source and you can use open models with it, so you can stay in even more familiar territory.

pi-coding-agent (pi.dev) is also great. I've been using it with Gemma 4 and Qwen 3.6.

Running an open like Kimi constantly for an entire month will cost around 100-200$, being roughly equal to a pro-tier subscription. This is not my estimate so I’m more than open to hearing refutations. Kimi isn’t at all Opus-level intelligent but the models are roughly evenly sized from the guesses I’ve seen. So I don’t think it’s the infra being subsidized as much as it’s the training.

Kimi costs 0.3/$1.72 on OpenRouter, $200 for that gives you way more than you would get out of a $200 Claude subscription. There are also various subscription plans you can use to spend even less.

I’m using Composer 2, Cursor’s model they built on top of Kimi, and it’s great. Not Opus level, but I’m finding many things don’t need Opus level.

It's all I use at work and I've yet to find anything it can't handle. Then again, I'm a principal engineer and I already have designs in mind, so I'm giving it careful instruction and checking its work every time.

How do you get anything sensible out of Kimi?

Now that Anthropic have started hiding the chain of thought tokens, it will be a lot harder for them

Anthropic and OpenAI never showed the true chain of thought tokens. Ironically, that's something you only get from local models.

Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: