More

davebren · 2026-04-18T16:34:52 1776530092

You copied two human coded native apps into a vibe coded react app? If the vibe coding is so good why wouldn't you keep the native apps and vibe code on top of them instead of spending a bunch of money to reach feature parity with a worse version?

davebren · 2026-04-17T21:42:55 1776462175

Remember, every product they release expands the scope of their non-compete clause, and they like their lawsuits.

davebren · 2026-04-17T19:28:21 1776454101

No it prevents businesses from selling directly from their site at a discount, and eliminates any incentive consumers have to purchase a product outside of amazon. It's one of the ways they became a monopoly, in addition to selling at a loss until all the small businesses were forced to close.

davebren · 2026-04-17T15:48:45 1776440925

You can have legitimate use cases where it's a core functionality of the application to store it, so the user obviously knows it's being collected and agrees by using it.

foresto · 2026-04-18T01:33:03 1776475983

Storage accessible only to the user is usually not what we mean when we say data collection.

davebren · 2026-04-18T16:29:27 1776529767

If data is collected and stored on a server, that is what we mean when we say data collection.

davebren · 2026-04-12T00:35:41 1775954141

This exploiting of benchmarks isn't that interesting to me since it would be obvious. The main way I assume they're gaming the benchmarks is by creating training data that closely matches the test data, even for ARC where the test data is secret.

jmalicki · 2026-04-12T00:41:01 1775954461

They said they used things like submitted a `conftest.py` - e.g. what would be considered very blatant cheating, not just overfitting/benchmaxxing. Did you read the AI slop in the post?

This is basically a paper about security exploits for the benchmarks. This isn't benchmark hacking like having hand coded hot paths for a microbenchmarks, this is hacking like modifying the benchmark computation code itself at runtime.

davebren · 2026-04-12T00:49:08 1775954948

I get it, but why would anyone trust what these companies say about their model performance anyway. Everyone can see for themselves how well they complete whatever tasks they're interested in.

davebren · 2026-04-11T19:52:40 1775937160

As cooked as we were pre-LLMs knowing that security exploits are relatively easy to learn about online and use, yet things keep chugging along.

dominicq · 2026-04-11T20:04:21 1775937861

This would just speed up the discovery -> patch cycle, at least until such time that all the low hanging fruit (=represented in training data) is patched.

Though another possibility would be that since LLMs generate so much code, the LLM vulnerability discovery would just keep chugging along and we'd simply settle for the same amount of potential vulns, same relative vulnerability-exploit-patch dynamics, though higher in absolute numbers.

davebren · 2026-04-11T19:48:34 1775936914

> If smaller models can find these things, that doesn’t mean mythos is worse than we thought. It means all models are more capable.

It means "it's so dangerous we can't release it" was a blatant lie since anthropic would have already known this.

pertymcpert · 2026-04-12T08:34:27 1775982867

No one seems to have actually read the system card all the way through.

The reason they didn't publish it was that it's orders of magnitude more successful at writing exploits vs Opus 4.6, which only managed it something like 2% of the time.

bryantwolf · 2026-04-11T22:06:08 1775945168

Sure, I think it’s reasonable to tell Anthropic the barn door is already open.

Though, like, I guess I expect that when this comes out, all the opus traffics will move over. It does appear to be much more capable, just jury is out about how much more capable

davebren · 2026-04-11T19:44:49 1775936689

It should at least get the same coverage anthropic got then, if not more.

davebren · 2026-04-11T19:43:16 1775936596

Seems perfectly comparable to anthropic's method, they just wrapped the same kind of prompt in a for loop.

davebren · 2026-04-10T19:33:29 1775849609

You've got to admit that crying wolf about how dangerous their new model is for the hundredth time right when the biggest story about the company was a leak that made them and their internal vibe-coding look totally incompetent is a bit suspect.

skybrian · 2026-04-10T19:58:21 1775851101

Your cynicism doesn't prove that it's fake, though.

davebren · 2026-04-10T20:43:20 1775853800

You got the causality backwards, the cynicism is because it's most likely to be fake.

skybrian · 2026-04-10T22:43:28 1775861008

A lesson of the parable about "crying wolf" is that cynicism based on previous events doesn't prove that the next event is fake. The people who ignored the warning may have thought it "most likely," but they were wrong.

nothinkjustai · 2026-04-11T01:10:40 1775869840

Never use previous actions to predict future actions, genius advice.

skybrian · 2026-04-11T14:55:29 1775919329

It’s more about when your priors are so strong that it’s not worth paying attention to a new report. Clearly not in this case.

davebren · 2026-04-11T20:05:00 1775937900

The point of the parable is not about the problem of induction but about how lying erodes trust.