You copied two human coded native apps into a vibe coded react app? If the vibe coding is so good why wouldn't you keep the native apps and vibe code on top of them instead of spending a bunch of money to reach feature parity with a worse version?
No it prevents businesses from selling directly from their site at a discount, and eliminates any incentive consumers have to purchase a product outside of amazon. It's one of the ways they became a monopoly, in addition to selling at a loss until all the small businesses were forced to close.
You can have legitimate use cases where it's a core functionality of the application to store it, so the user obviously knows it's being collected and agrees by using it.
This exploiting of benchmarks isn't that interesting to me since it would be obvious. The main way I assume they're gaming the benchmarks is by creating training data that closely matches the test data, even for ARC where the test data is secret.
They said they used things like submitted a `conftest.py` - e.g. what would be considered very blatant cheating, not just overfitting/benchmaxxing. Did you read the AI slop in the post?
This is basically a paper about security exploits for the benchmarks. This isn't benchmark hacking like having hand coded hot paths for a microbenchmarks, this is hacking like modifying the benchmark computation code itself at runtime.
I get it, but why would anyone trust what these companies say about their model performance anyway. Everyone can see for themselves how well they complete whatever tasks they're interested in.
This would just speed up the discovery -> patch cycle, at least until such time that all the low hanging fruit (=represented in training data) is patched.
Though another possibility would be that since LLMs generate so much code, the LLM vulnerability discovery would just keep chugging along and we'd simply settle for the same amount of potential vulns, same relative vulnerability-exploit-patch dynamics, though higher in absolute numbers.
No one seems to have actually read the system card all the way through.
The reason they didn't publish it was that it's orders of magnitude more successful at writing exploits vs Opus 4.6, which only managed it something like 2% of the time.
Sure, I think it’s reasonable to tell Anthropic the barn door is already open.
Though, like, I guess I expect that when this comes out, all the opus traffics will move over. It does appear to be much more capable, just jury is out about how much more capable
You've got to admit that crying wolf about how dangerous their new model is for the hundredth time right when the biggest story about the company was a leak that made them and their internal vibe-coding look totally incompetent is a bit suspect.
A lesson of the parable about "crying wolf" is that cynicism based on previous events doesn't prove that the next event is fake. The people who ignored the warning may have thought it "most likely," but they were wrong.
reply