My anecdata is that it heavily depends on how much of the relevant code and instructions it can fit in the context window.
A small app, or a task that touches one clear smaller subsection of a larger codebase, or a refactor that applies the same pattern independently to many different spots in a large codebase - the coding agents do extremely well, better than the median engineer I think.
Basically "do something really hard on this one section of code, whose contract of how it intereacts with other code is clear, documented, and respected" is an ideal case for these tools.
As soon as the codebase is large and there are gotchas, edge cases where one area of the code affects the other, or old requirements - things get treacherous. It will forget something was implemented somewhere else and write a duplicate version, it will hallucinate what the API shapes are, it will assume how a data field is used downstream based on its name and write something incorrect.
IMO you can still work around this and move net-faster, especially with good test coverage, but you certainly have to pay attention. Larger codebases also work better when you started them with CC from the beginning, because it's older code is more likely to actually work how it exepects/hallucinates.
In a well-designed system, you can point an agent at a module of that system and it's perfectly capable of dealing with it. Humans also have a limited context window, and divide and conquer is always how we've dealt with it. The same approach works for agents.
The consumer surplus is quite high. Even with the regressions in this postmortem, performance was above the models last fall, when I was gladly paying for my subscription and thought it was net saving me time.
That said, there is now much better competition with Codex, so there's only so much rope they have now.
Is this cash or compute? Elon has one of the world's biggest compute clusters spun up, and little inference demand to speak of.
Trading billions worth of idle compute, in exchange for a high-strike call option on the #3 player in the most-promising-vertical for AI, plus (presmuably) some access to their data, starts to sound like not a bad trade. Especially if you're pre-committed to betting your entire rocket company on winning in AI, and you're currently in sixth or seventh place.
It's true he could write off xAI today and the company could still fetch a trillion-dollar valuation. But I was more referring to his stated intentions - between his stated plans, his actions taking SpaceX from a profitable company to spending basically all their revenue (plus a rumored large chunk of what's raised via its IPO) on AI, and seeing his tendency to make bet-the-farm bets on Tesla, I think it's fair to say he's committing to bet all of SpaceX on xAI.
I heard he made a deal with a company to use his clusters. Is there good data on demand for Grok? Seems like relatively little chatter at least, in spite of tremendous investment.
He had a very close, decades long friendship with the most notorious sex-trafficker-of-children-to-rich-creeps in modern history for decades. And when imprisoned, that infamous pedophile died while in a federal prison under Trump's control, with a strange gap in the CCTV video footage. And Trump's handling of the entire Epstein Files saga makes it clear that Trump is described extensively in those files and he desperately wants to conceal it. What could be in there that he would use the entire justice department to try and redact? Trump is shameless about things that are legal even if they're salacious (like sleeping with porn star Stormy Daniels), so you have to wonder, what could Jeffery Epstein's good friend be trying to conceal?
Also, he owned the Miss Universe org (including Miss USA and Miss Teen USA) for decades, and he was known to walk into the dressing rooms of teen contestants as young as 15 while they were undressed. [0]
Also, he bragged about molesting women, and a court of law found that he sexually assaulted E Jean Carroll.
I haven't proven the case that Trump had sex with a minor, but there's way more than enough probable cause to believe it's more likely than not.
Imagine there's a camera continuously recording a cookie jar. A child eats all of the cookies and then deletes the footage from the time they ate the cookies. A parent returns to find their child covered in crumbs, loudly proclaiming they haven't eaten a cookie in years and actively interferes with the parent's investigation and tries to distract from it by throwing a brick through the window of an Iranian family down the street.
Are any of the facts in this hypothetical "evidence"? With the knowledge of the truth (that the kid ate the cookies), it's clear these are all relevant pieces of evidence. If we take knowledge of the truth out of the equation, would these facts still be evidence? Unambiguously they would.
Definitionally both circumstantial and direct evidence are forms of evidence. No modifier is necessary.
And incidentally you can be convicted in a court of law purely on circumstantial evidence, and that's the place in society where we have the highest standard of proof. The evidence all being circumstantial is not a gotcha.
This isn't court. The evidence, such as it is, is all of the smoke which commonly motivates people to look for fire. The strongest and most comprehensive that I've seen is the argument that if Trump was not implicated in the Epstein files, he would be publishing them in free book form himself and forcing every media outlet to advertise it. Slight exaggeration, but I think truly only slight.
Not really relevant to the thread, but there are simple answers to the "eViDeNcE??" question. You may have already known this.
Has the availability of deepfake porn generation reduced the demand for deepfake porn featuring real people? When deepfake generators are capable of creating convincing imagery of flawless ideal fake humans, why do you suppose there’s so many real humans who report being non-consensual subjects of deepfake porn?
> Has the availability of deepfake porn generation reduced the demand for deepfake porn featuring real people?
yes
> When deepfake generators are capable of creating convincing imagery of flawless ideal fake humans, why do you suppose there’s so many real humans who report being non-consensual subjects of deepfake porn?
> Doesn't have to be. You can train it on normal pictures of children and nude images of adults.
You say this so casually, as though it were a normal thing to know, or as if a normal person would know it. Does that actually seem true where you live right now?
And how do you know that, anyway, Harsh? I mean, all those "unblocked" games you stole to give away and that you also put on Github, that's one thing. But this...
Come on, it's not hard to come up with this idea. And it's not even true, model trained on clothed children and nude adults wouldn't know how children's genitals look like.
Yes, cost per successful task is rising - ie, we are all paying effectively more for AI.
And yet - Anthropic is still struggling to have enough capacity to serve demand - they are virtually sold out.
And yes, are almost-as-good open models, on part with the closed models from 6 months ago (at worst), that are just a single Openrouter API call away, and yet Anthropic is still selling out. So people are paying for the premium product anyway, for whatever reason - maybe the last bit of intelligence is worth it, maybe they like the harnesses/products around the models, maybe it's a brand/enterprise sales thing.
Put aside your feelings about the AI industry and imagine we are talking about thingamajigs. Prices for thingamajigs are going up. They are still selling out about as fast (or faster) than the company selling them can build factories. There are more cost-effective competitors already in the market, but thingamajigs are selling out anyway.
Would you, looking at the thingamajig industry, conclude the "jig is almost up"? That "the returns aren’t anywhere close to what investors expect" and that the impending IPO is all some desperate hail mary to save things before the collapse?
I don’t have feelings about the AI industry to put aside. I would not have sufficient information to assess whether thingamajigs are legitimately valuable or whether they are tulips. The only indicator I see is the last point about people using it in the short term despite having access to cost effective alternatives, which actually points to irrationality/FOMO more than legitimate value.
What we are looking at looks to me like it is rapidly becoming a a commodity: it will become as existential as electricity and water to businesses, and it will be sold and marketed and regulated, more or less like a utility.
I agree, but also the model intelligence is quite spikey. There are areas of intelligence that I don't care at all about, except as proxies for general improvement (this includes knowledge based benchmarks like Humanity's Last Exam, as well as proving math theorems etc). There are other areas of intelligence where I would gladly pay more, even 10X more, if it meant meaningful improvements: tool use, instruction following, judgement/"common sense", learning from experience, taste, etc. Some of these are seeing some progress, others seem inherent to the current LLM+chain of thought reasoning paradigm.
The models that we are paying to generate tokens are already not really just LLMs, as anyone studying language models ten years ago (or someone who describes them as "next token predictors") would understand them. Doing a bunch of reinforcement learning so that a model performs better at ssh'ing into my server and debugging my app is already realllly stretching the definition of "language pattern".
I think when we do get AI that can perform as well as a human at functionally all tasks, they will be multi-paradigm systems; some components will not resemble anything in any commercial system today, but one component will be recognizably LLM-like, and act as an essential communication layer.
Different users do seem to be encountering problems or not based on their behavior, but for a rapidly-evolving tool with new and unclear footguns, I wouldn't characterize that as user error.
For example, I don't pull in tons of third-party skills, preferring to have a small list of ones I write and update myself, but it's not at all obvious to me that pulling in a big list of third-party skills (like I know a lot of people do with superpowers, gstack, etc...) would cause quota or cache miss issues, and if that's causing problems, I'd call that more of a UX footgun than user error. Same with the 1M context window being a heavily-touted feature that's apparently not something you want to actually take advantage of...
I'm pretty optimistic that not only does this clean up a lot of vulns in old code, but applying this level of scrutiny becomes a mandatory part of the vibecoding-toolchain.
The biggest issue is legacy systems that are difficult to patch in practice.
I could see some of these corps now being able to issue more patches for old versions of software if they don't have to redirect their key devs onto prior code (which devs hate). As you say though, in practice it is hard to get those patches onto older devices.
I'm looking at you, Android phone makers with 18 months of updates.
Off course not, but there is infinitely more vulnerable software escaping Anthropic's scrutiny. And when AI-powered discovery becomes a necessity, that will lead to concentration of power to these kinds of companies.
Bruce Scheier made a comprehensive analysis of the pros and cons and forces at play for adversary and defenders [1].
I think it's safe to predict yet more money previously directed to us techies will find its way to the Anthropics of this world.
I imagine that some levels of patching would be improving as well, even as a separate endeavor. This is not to say that legacy systems could be completely rewritten.
If we have the source and it's easy to test, validate, and deploy an update - AI should make those easier to update.
I am thinking of situations where one of those aren't true - where testing a proposed update is expensive or complicated, that are in systems that are hard to physically push updates to (think embedded systems) etc
I feel like every new iteration of ways to find good content online: webrings, blogrolls, user upvoting/downvoting, giving everyone their own microblog to share interesting links, ML to learn your own preferences by your behavior - they all worked really well at first, but then eroded significantly once people figured out how to game them.
The economic incentive is overwhelming to corrupt these signals, either directly (link sharing schemes, upvote rings, bots to like your content) or indirectly (shaping your content itself to have the shape of what will be promoted, regardless of its quality).
What you almost want is to use any of these ideas and hope for it to catch on widely enough in your small niche to be useful, but not so much that it comes an optimization target.
Smolnet might be the answer. There really isn't a feasible mechanism for monetizing it. At worst, you could have some text ad embedded. No images. Minimal semantic markup (links, lists, quotes, code, generic text) in the case of gemini/gemtext.
It's CNBC for Silicon Valley - a combination of good background noise, a broad survey of what people are talking about around the valley, and occasionally really great interviews.
They get a lot of guests to do interviews that they wouldn't do elsewhere, in part because they are unabashedly and unapologetically cheerleaders - pro-tech, pro-VC, pro-startup, pro-Big-Tech, etc. They don't grill you like an old-school journalist would about whatever the latest political controversy is, they ring a giant gong when their guest brings up a cool traction or fundraising number.
I would never use it as my only source of news for what's going on in tech, but with a lot of other tech journalism covering the downsides or problems with the industry, there is definitely a niche for them.
Just based on the number of very prominent guests they get to do interviews, they clearly have a lot of viewers in influential tech/vc circles, even if their total audience size isn’t huge.
That's true, but a lot of these people are also competitors. I can't imagine it'll be attractive going to the OpenAI media channel to talk about Gemini or Grok.
A small app, or a task that touches one clear smaller subsection of a larger codebase, or a refactor that applies the same pattern independently to many different spots in a large codebase - the coding agents do extremely well, better than the median engineer I think.
Basically "do something really hard on this one section of code, whose contract of how it intereacts with other code is clear, documented, and respected" is an ideal case for these tools.
As soon as the codebase is large and there are gotchas, edge cases where one area of the code affects the other, or old requirements - things get treacherous. It will forget something was implemented somewhere else and write a duplicate version, it will hallucinate what the API shapes are, it will assume how a data field is used downstream based on its name and write something incorrect.
IMO you can still work around this and move net-faster, especially with good test coverage, but you certainly have to pay attention. Larger codebases also work better when you started them with CC from the beginning, because it's older code is more likely to actually work how it exepects/hallucinates.
reply