You’re right that I mixed runtime enforcement with CI drift/regression testing. Different layer, different job.
I meant it as complementary, not equivalent. CrabTrap for runtime control, EvalView for deterministic testing/diffing. My bad on making it sound like a drive-by promo.
What do you think about slow rollouts for new features? Like, we think this new push notification system will be loved but let’s ship to only 1% of users in case there’s a horrible unforeseen consequence like occasional 10min delays? Dashboard goes upside down -> revert then work through logs to figure out what the hell went wrong.
So you're perfectly okay with repeatedly paying for a shit product, getting shat on by the company in the form of being tested for feedback, and "maybe" getting a better product in the future. Mind you, that "better" isn't necessarily better for you but more explicitly better for the company you're paying.
Sounds like someone who doesn't care about being a sheep. Or maybe someone whose salary depends on having sheep.
I think you are making far too wide-sweeping statements. I think most people here probably agree that if Anthropic drops Claude Code from the Pro plan after people have paid with the understanding that it is part of the package, that would be wrong, and they deserve to lose business over it. However, there are plenty of situations where A/B testing is entirely benign, and I would not have any problem with a company doing that testing without getting consent first. Every form of A/B testing is not done just for the gain of the company doing the testing.
> is the mechanism you'd build if you wanted to shape what a billion users read without them noticing.
A pretty large accusation at the end. That no specific word swaps were given as an example outside the first makes it feel far too clickbate than real though
I’ve been keeping them open in tmux and using either send_keys or paste buffer for communication. Using print mode and always resume last means you can’t have parallel systems going.
I just switched fully into Codex today, off of Claude. The higher usage limits were one factor but I’m also working towards a custom harness that better integrates into the orchestrator. So the Claude TOS was also getting in the way.
For a business with ten or more engineers/people-using-ai, it might still make sense to set this up. For an individual though, I can’t imagine you’d make it through to positive ROI before the hardware ages out.
It's hard to tell for sure because the local inference engines/frameworks we have today are not really that capable. We have barely started exploring the implications of SSD offload, saving KV-caches to storage for reuse, setting up distributed inference in multi-GPU setups or over the network, making use of specialty hardware such as NPUs etc. All of these can reuse fairly ordinary, run-of-the-mill hardware.
I have 24GB VRAM available and haven't yet found a decent model or combination.
Last one I tried is Qwen with continue, I guess I need to spend more time on this.
im currently running a custom Gemma4 26b MoE model on my 24gb m2... super fast and it beat deepseek, chatgpt, and gemini in 3 different puzzles/code challenges I tested it on. the issue now is the low context... I can only do 2048 tokens with my vram... the gap is slowly closing on the frontier models
It's a MoE model so I'd assume a cheaper MBP would simply result in some experts staying on CPU? And those would still have a sizeable fraction of the unified memory bandwidth available.
I haven’t tried this myself yet but you would still need enough non-vram ram available to the cpu to offload to cpu, right? This is a fully novice question, I have not ever tried it.
You're correct. If you don't have enough RAM for the model, it can still run but most of it will run on the CPU and be continuously reloaded from the SSD (through mmap).
A medium MoE like 35B can still achieve usable speeds in that setup, mind you, depending on what you're doing.
I want to bump this more than just a +1 by recommending everyone try out OpenCode. It can still run on a Codex subscription so you aren’t in fully unfamiliar territory but unlocks a lot of options.
Running an open like Kimi constantly for an entire month will cost around 100-200$, being roughly equal to a pro-tier subscription. This is not my estimate so I’m more than open to hearing refutations. Kimi isn’t at all Opus-level intelligent but the models are roughly evenly sized from the guesses I’ve seen. So I don’t think it’s the infra being subsidized as much as it’s the training.
Kimi costs 0.3/$1.72 on OpenRouter, $200 for that gives you way more than you would get out of a $200 Claude subscription. There are also various subscription plans you can use to spend even less.
It's all I use at work and I've yet to find anything it can't handle. Then again, I'm a principal engineer and I already have designs in mind, so I'm giving it careful instruction and checking its work every time.
This post is an AI-generated ad, isn’t it? It’s getting too hard to tell!
reply