More

Imnimo · 2026-05-05T20:58:02 1778014682

>But this is where the line slightly blurs in my head. Did we possibly just build the first human biocomputer and immediately put it in a simulated hell, playing the same game on loop, forever? Using the same reward mechanisms we use for LLMs?

This description does not seem to really match what was done in the Doom demo, and makes me skeptical that the author has actually looked into the details.

jwilber · 2026-05-06T04:57:28 1778043448

Author clearly doesn’t know the field well at all. First few paraphrases reveal this. Opening sentence: I’ve been in the AI space since ChatGPT first dropped.

Everyone is allowed to have an opinion, but that doesn’t mean they’re all worth listening to. Unfortunately, right now, all of those opinions are about ai.

sinfulprogeny · 2026-05-06T08:37:03 1778056623

> since ChatGPT first dropped.

That'd be in November of 2022.

https://openai.com/index/chatgpt/

robot-wrangler · 2026-05-05T21:09:44 1778015384

> skeptical that the author has actually looked into the details.

Nevermind the experiment.. same deal for a lot of people who are only interested enough to offer opinions about consciousness and theory-of-mind without doing any of the boring background reading.

The bottom line in TFA is maybe just about unapologetic carbon-chauvinism. But although OP has "been in the AI space since ChatGPT first dropped" and "bothered by this for months", they don't seem aware of terms or the usual problems with this position. Your average non-technical scifi reader has a more nuanced take than AI bros puffing up blogs for linked-in traffic

Imnimo · 2026-04-30T03:30:21 1777519821

The bad boy of science!

Imnimo · 2026-04-29T15:52:23 1777477943

My read is not so much "if we say this is dangerously powerful, it will make people want to buy our product", but rather that there is a significant segment of AI researchers for whom x-risk, AI alignment, etc. is a deal-breaker issue. And so the Sam Altmans of the world have to treat these concerns as serious to attract and retain talent. See for example OpenAI's pledge to dedicate 20% of their compute to safety research. I don't get the sense that Sam ever intended to follow through on that, but it was very important to a segment of his employees. And it seems like trying to play both sides of this at least contributed to Ilya's departure.

On the other hand, it seems like Dario is himself a bit more of a true believer.

james2doyle · 2026-04-29T15:58:21 1777478301

There have been a number of people leaving them because of that bait and switch it seems. That 20% turned out to be something closer to 2% or even 1%

chis · 2026-04-29T16:00:15 1777478415

Yeah I just don't buy that it would somehow help AI companies for everyone to be existentially afraid of their technology. It seems much more reasonable to think that they really believe the things they're saying, than that it's some kind of 4d chess.

Additionally Dario has just been really accurate with his predictions so far. For instance in early 2025 he predicted that nearly 100% of code would be written with AI in 2026.

alecbz · 2026-04-29T16:16:00 1777479360

I think if you just look at what people like e.g. Sam Altman are doing it's clear that they don't believe everything that they're saying regarding AI safety.

> nearly 100% of code would be written with AI in 2026

I feel like this is kind of a meaningless metric. Or at least, it's very difficult to measure. There's a spectrum of "let AI write the code" from "don't ever even look at the code produced" to "carefully review all the output and have AI iterate on it".

Also, it seems possible as time goes on people will _stop_ using AI to write code as much, or at least shift more to the right side of that spectrum, as we start to discover all kinds of problems caused by AI-authored code with little to no human oversight.

roxolotl · 2026-04-29T16:19:36 1777479576

It helps with sales because they position it as “we can give you the power to end the world.” There’s plenty of people who want to wield that sort of power. It doesn’t have to be 4D chess. Maybe they are being genuine. But it is helping sales.

cyanydeez · 2026-04-29T16:41:40 1777480900

Isn't it more: "We can give you the power to eliminate the people in your organization you dont like" and expands into basically dismantling all government & business for the benefit of the guy with the largest wallet?

It's hard to see as anything but a button anyone with enough money can press and suddenly replace the people that annoy them (first digitally then likely, into flesh).

DennisP · 2026-04-29T16:53:21 1777481601

They're not saying today's AI has that kind of power, and they're not saying future superintelligent AI will give you that power. They're saying it will take all power from you, and possibly end you.

If this is some kind of twisted marketing, it's unprecedented in history. Oil companies don't brag about climate change. Tobacco companies don't talk about giving people cancer. If AI companies wanted to talk about how powerful their AI will be, they could easily brag about ending cancer, curing aging, or solving climate change. They're doing a bit of that, but also warning it might get out of control and kill us all. They're getting legislators riled up about things like limiting data centers.

People saying this aren't just company CEOs. It's researchers who've been studying AI alignment for decades, writing peer reviewed papers and doing experiments. It's people like Geoffrey Hinton, who basically invented deep learning and quit his high-paying job at Google so he could talk freely about how dangerous this is.

This idea that it's a marketing stunt is a giant pile of cope, because people don't want to believe that humanity could possibly be this stupid.

otabdeveloper4 · 2026-04-29T17:49:07 1777484947

> If this is some kind of twisted marketing, it's unprecedented in history.

They're marketing AI to investors, not to end-user plebs.

This is a pump-and-dump scheme.

DennisP · 2026-04-29T20:57:24 1777496244

Exxon has never bragged to investors that they'd burn so much oil, civilization would collapse from climate change. They've always talked about how great fossil fuels are for the economy and our living standards. It makes no sense to sell apocalypse to investors either.

otabdeveloper4 · 2026-04-30T06:03:03 1777528983

They're selling FOMO to investors.

"Last chance to jump on the AI train, invest into your future robot overlord or be turned into biodiesel for datacenters in the future."

DennisP · 2026-04-30T12:01:31 1777550491

There's no reason to think an out-of-control ASI would spare its investors.

otabdeveloper4 · 2026-04-30T19:43:27 1777578207

There's no reason to think it wouldn't. Shouldn't you hedge your bets?

Also, you can probably make a shitton of money as an out-of-control-AI-investor while the world is in the process of being destroyed.

DennisP · 2026-05-01T00:14:38 1777594478

There are all sorts of things you could do that might make an AI like you, and none of them have more justification than any other. This is not an argument AI firms are making.

I agree that short-term greed is driving investment, but it would drive just as much investment if AI companies were not warning of apocalypse. Probably it would drive even more, because there'd be less risk of regulatory interference, and more future profit to discount into the present.

So why are they making those warnings? It doesn't benefit them. The simplest explanation is that this stuff actually is dangerous, and people who know that are worried.

otabdeveloper4 · 2026-05-02T05:37:09 1777700229

> So why are they making those warnings? It doesn't benefit them.

Because "we built a chatbot that can generate technical debt" is not a good proposition for investors. "Invest into our AI before it takes over the world and fires all knowledge workers" is.

> The simplest explanation is that this stuff actually is dangerous, and people who know that are worried.

LMAO. Please.

edbaskerville · 2026-04-29T16:05:56 1777478756

Does anyone have good estimates of what percent of real production code is currently being written by LLMs? (& presumably this is rather different for your typical SaaS backend vs. frontend vs. device drivers vs. kernel schedulers...)

mbesto · 2026-04-29T16:14:14 1777479254

By all companies? I'd say less than 10% of all LOC today are generated by LLMs.

scottyah · 2026-04-29T16:40:39 1777480839

Really? In my bubble of internet news it seems the sheer number of companies that have formed and shipped LLM code to production has already surpassed existing companies. I've personally shipped dozens of (mediocre) human months or years worth of code to "production", almost certainly more than I've ever done for companies I've worked at (to be fair I've been a lot more on the SRE side for a few years now).

SpicyLemonZest · 2026-04-29T16:16:27 1777479387

Depends on your reference class. There's a lot of companies and teams where it's literally 100%, and I would be surprised if there were any top company where it's below 75%. I wouldn't be terribly surprised if the industry-wide percentage were a lot lower, although I also have no idea how you'd measure that.

otabdeveloper4 · 2026-04-29T17:51:31 1777485091

> I would be surprised if there were any top company where it's below 75%

I would be surprised if there were any top company where it's above 5%.

The slop Claude generates isn't going anywhere near production without being edited by hand.

SpicyLemonZest · 2026-04-29T18:19:44 1777486784

Perhaps it depends on what you mean by "edited by hand"? It's definitely still common for human beings to review generated code and tell Claude "no you need to do it this way". But most developers at Google, Meta, etc. no longer open up an IDE and type in code themselves.

otabdeveloper4 · 2026-04-30T06:01:21 1777528881

I don't give a bleep what the bleeps at Google and Meta are doing. (Judging by the quality of ""software"" they put out - probably nothing all day.)

In reality it's extremely rare that AI generated code isn't combed through line-by-line and refactored.

(For real software, that is, not VC scams like OpenClaw or litellm or whatever.)

b00ty4breakfast · 2026-04-29T16:33:31 1777480411

it pushes the idea that these programs are super amazing and powerful to people who are non-technical. It also allows them to control the narrative of how exactly AI is dangerous to society. Rather than worry about the energy consumption of all these new datacenters, they can redirect attention to some far-off concern about SHODAN taking over Citadel Station and turning the inhabitants into cyber-mutants or whatever.

rootusrootus · 2026-04-29T18:58:55 1777489135

> nearly 100% of code would be written with AI in 2026

HN is the only place I have heard it seriously suggested that anything like this is happening or likely to happen. We certainly get a lot of cheerleading here, my guess is that in the trenches the fraction is way lower.

Terr_ · 2026-04-29T17:21:03 1777483263

> Yeah I just don't buy that it would somehow help AI companies for everyone to be existentially afraid of their technology.

It makes more sense if one breaks that "everyone" into subgroups. A good first-pass split would be "investors" versus "everyone else."

From their perspective: Rich Investor Alice rushing over with bags of money because of FOMO >>> Random Person Bob suffers anxiety reading the news.

One can hone it a bit more by thinking about how it helps them gain access to politicians, media that's always willing to spread their quotes, and even just getting CEO Carol's name out there.

goatlover · 2026-04-29T17:20:09 1777483209

I'd argue if they really believed AI was an existential threat, they would shut down research and encourage everyone else to halt R&D. But then again, the Cold War happened, even over the objections of physicists like Einstein & Oppenheimer.

haritha-j · 2026-04-29T17:00:11 1777482011

When your statements directly influence millions of dollars in revenue, its always 4D chess. If Sam altman beleives half the stuff he's peddling, I'd be very shocked.

autoexec · 2026-04-29T16:15:52 1777479352

> It seems much more reasonable to think that they really believe the things they're saying

It seems more reasonable to me to think that they know it's bullshit and it's just marketing. Not necessarily marketing to end users as much as investors. It's very hard to take "AGI in 3 years" seriously.

mghackerlady · 2026-04-29T18:20:21 1777486821

AGI in 3 years is literally not possible as it stands. Our current idea of "AI" as an LLM fundamentally will never be able to reach that goal without some absolutely massive changes

autoexec · 2026-04-29T19:00:16 1777489216

At least Dario Amodei kept the window short. When AGI fails to magically appear in 3 years he will be discredited and we can all agree that he's full of shit and treat everything he says accordingly. This is a huge improvement over the "just 10 years away" prophesying we usually get.

not_wyoming · 2026-04-29T16:06:21 1777478781

To my mind, "if we don't say this is dangerously powerful, we will not be able to hire the talent we need to build this product" is the supply-side version of "if we do say this is dangerously powerful, it will make people want to buy our product".

b00ty4breakfast · 2026-04-29T16:27:02 1777480022

Maybe Altman specifically is only paying lip service to this stuff, but when a company like Anthropic is like "BRO MYTHOS IS TOO DANGEROUS BRO WE CANT EVEN RELEASE IT BRO JUST TRUST US BRO", my bullshit detector is beeping too loud to ignore. It's very obviously a publicity stunt, because if it were actually that dangerous you wouldn't be making such a press release, you'd be keeping your mouth shut and working to make it safe.

SpicyLemonZest · 2026-04-29T16:46:37 1777481197

They explained in detail why they felt they had to talk about it. They think there's no safe deployment strategy other than fixing all the vulnerabilities it's likely to find, and there are too many such vulnerabilities for them to fix without getting help from a substantial number of trusted partners.

b00ty4breakfast · 2026-04-29T16:52:29 1777481549

All due respect, that's the biggest crock I've ever heard in my life.

SpicyLemonZest · 2026-04-29T16:57:30 1777481850

I understand where you're coming from. I can imagine myself reacting similarly if HP announced that they've invented a printer so powerful that it can print documents you don't have access to. But I don't know how to engage with this response, other than to say that Anthropic's story is plausible to me and everyone I know in either AI or security.

habinero · 2026-04-29T18:47:29 1777488449

I work in security, and I think it's marketing BS meant to drive FOMO until proven otherwise.

You cannot take any claims from these people seriously, they lie constantly.

DANmode · 2026-04-30T03:45:18 1777520718

> I work in security

Doing what? School admins work in security.

habinero · 2026-04-30T06:26:48 1777530408

Your mom, primarily.

Also general blue team shit and appsec.

DANmode · 2026-04-30T06:30:10 1777530610

lol.

Aren’t you convinced by the posts by security researchers (and more to the point, non-security-researchers) claiming semiautonomous (or better) 0day discovery with these tools?

Haven’t seen enough of them?

Help me understand.

habinero · 2026-04-30T10:46:40 1777546000

Why? I'm clearly not going to convince you lol. You convince me I should.

scottyah · 2026-04-29T16:44:24 1777481064

I'm fairly certain it's both. They aren't going to be making a lot of money until they release it so they might as well get something (marketing) out of it, as well as spread more awareness so those paying attention can start preparing for what's to come. We'll see how effective it is with all their hashed patches or whatever.

fssys · 2026-04-29T16:09:53 1777478993

extremely naive!

Imnimo · 2026-04-28T16:32:38 1777393958

Unsurprising from Google, but still bad. If Google has no right to object to a particular use, this is equivalent in practice to "any use, lawful or not".

Imnimo · 2026-04-27T19:42:39 1777318959

Back when Arena was first announced, there was an interesting line in their write-up:

https://magic.wizards.com/en/news/feature/everything-you-nee...

>We've created an all-new Games Rules Engine (GRE) that uses sophisticated machine learning that can read any card we can dream up for Magic. That means the shackles are off for our industry-leading designers to build and create cards and in-depth gameplay around new mechanics and unexpected but widly fun concepts, all of which can be adapted for MTG Arena thanks to the new GRE under the hood.

At the time, this claim of using "sophisticated machine learning" to (apparently?) translate natural language card text into code that a rules engine could enforce struck me as obviously fake. Now nearly ten years later, AI is starting to reach a level where this is plausible.

In their letter, the union writes:

>Over the past few years, pressure has ramped up from leadership to adopt LLMs and Gen AI tools in various aspects of our work at WOTC, often over the explicit concerns of impacted employees

I'm curious if this would include fighting against turning WotC's old fanciful claim into a reality as the technology matures?

JRandomHacker42 · 2026-04-27T20:24:47 1777321487

The Arena card engine is based on CLIPS [1] and not modern LLM-based tools. Magic cards are written in a very constrained language (usually called "card templating") that lends itself very well to machine-parseability.

[1]: https://www.clipsrules.net/

tanjtanjtanj · 2026-04-27T21:07:32 1777324052

There was an era about 10-15 years long where that was true but modern cards often fall back on very loose language that they tighten up with rulings prior or sometimes after release. See the language behind the "Prepared" key word in the newest set for a striking example.

JRandomHacker42 · 2026-04-27T21:22:24 1777324944

I don't think Prepared is ambiguous at all. It has its meaning defined in the CR (722) and every card that uses it has either a clear trigger condition or the "enters prepared" replacement effect. It's just a new designation and there are plenty of those already, including ones that are 10+ years old (Renowned, Monstrous, Level Up).

tanjtanjtanj · 2026-04-27T21:49:03 1777326543

Well there was a huge discussion on what it means amongst judges. On plain reading it does not do what it states it does. Start just with the phrase "its spell" - those words are entirely undefined until getting into the rulings, and don't mean what they mean in other contexts. It makes no mention of the prior rules that it sort of hacks into this either.

I have no idea what rule 722 has to do with prepared.

JRandomHacker42 · 2026-04-27T21:59:23 1777327163

I'm a judge and have seen barely any discussion around Prepared (mostly just clarification around the interaction with.

Rule 722 is the rule for "Preparation Cards", so I fail to see how it could not be relevant.

The text "its spell" only occur in reminder text, which is not rules text and would not be included in template language.

tanjtanjtanj · 2026-04-27T22:10:43 1777327843

Ah, I have I guess an old version of the CR downloaded from when I was a judge and TO where rule section 722 is "Controlling Another Player," weird reorganization there.

Are you active in judge forums or social media at all? Huge threads on prepared with these arguments (I didn't come up with the idea as I now longer play or judge).

Regardless of whether you think this one example is confusing, WOTC came out a few years back and said they were sacrificing clarity for more natural language as a development goal and it's clearly noticeable in the cards.

Sorry, throwing out a text because its reminder text is a cop-out, that's the only way 99%+ of players are going to interact with the rules. The rule makes sense when demonstrated but from a logical step-by-step when following what it says on the actual cards it does not actually function in a way that conforms to the way it is supposed to work.

JRandomHacker42 · 2026-04-27T22:23:00 1777328580

I'm specifically talking about the use case of "can we use natural-language tools to parse oracle text and produce functioning game objects in Arena". For that use case, it's completely sensible to look at the actual rules text and not reminder text.

Looking back further, there was confusion during preview season when people were looking at "fake/leaked" mockups that had incorrect text on them, but this also isn't a problem for the issue of "WotC themselves writing systems that can parse card text".

cleversomething · 2026-04-27T21:53:13 1777326793

I think that level of ambiguity would be fairly easy to tighten up using the CLIPS system that was previously discussed. It isn't bug-proof and has needed manual tune-ups before but it's much more "hardened" than what we think of as AI now with LLM-powered tools.

nicolas-siplis · 2026-04-27T19:55:25 1777319725

I'm actually working on this right now! https://chiplis.com/ironsmith

It's a parser + (de)compiler and rules engine which I'm trying to get to 100% coverage over all Standard/Modern/Vintage/Commander legal cards. About 23000 of them are partially supported, while 15k currently work in full (~3k more than what MTGA currently supports, IIRC). It also allows for P2P 4-way multiplayer which Arena unfortunately does not :/

HanClinto · 2026-04-28T13:47:42 1777384062

This is seriously impressive!!

What are you planning on doing with this? Where should I follow along?

cleversomething · 2026-04-27T21:50:51 1777326651

As others have said, there is an actual concrete system for translating card text into rules, and it's not an LLM (which would be a disaster).

I assume the wording in this letter is referring to using LLMs to generate slop as creative assets like images and music.

Imnimo · 2026-04-13T17:25:04 1776101104

>Unlike human brains, which are biologically predisposed to acquire prosocial behavior, there is nothing intrinsic in the mathematics or hardware that ensures models are nice.

How did brains acquire this predisposition if there is nothing intrinsic in the mathematics or hardware? The answer is "through evolution" which is just an alternative optimization procedure.

Terr_ · 2026-04-13T17:44:30 1776102270

> just an alternative optimization procedure

This "just" is... not-incorrect, but also not really actionable/relevant.

1. LLMs aren't a fully genetic algorithm exploring the space of all possible "neuron" architectures. The "social" capabilities we want may not be possible to acquire through the weight-based stuff going on now.

2. In biological life, a big part of that is detecting "thing like me", for finding a mate, kin-selection, etc. We do not want our LLM-driven systems to discriminate against actual humans in favor of similar systems. (In practice, this problem already exists.)

3. The humans involved making/selling them will never spend the necessary money to do it.

4. Even with investment, the number of iterations and years involved to get the same "optimization" result may be excessive.

Imnimo · 2026-04-13T21:15:04 1776114904

Why should we think that pro-social capabilities are simply not expressible by weight-based ANN architectures?

Terr_ · 2026-04-13T21:38:24 1776116304

Assuming that means capabilities which are both comprehensive and robust, the burden of proof lies is in the other direction. Consider the range of other seemingly-simpler things which are still problematic, despite people pouring money into the investment-machine.

Even the best possible set of "pro-social" stochastic guardrails will backfire when someone twists the LLM's dreaming story-document into a tale of how an underdog protects "their" people through virtuous sabotage and assassination of evil overlords.

fweimer · 2026-04-13T18:17:29 1776104249

While I don't disagree about (2), my experience suggests that LLMs are biased towards generating code for future maintenance by LLMs. Unless instructed otherwise, they avoid abstractions that reduce repetitive patterns and would help future human maintainers. The capitalist environment of LLMs seems to encourage such traits, too.

(Apart from that, I'm generally suspect of evolution-based arguments because they are often structurally identical to saying “God willed it, so it must true”.)

bigfishrunning · 2026-04-13T19:57:30 1776110250

I think they're biased toward code that will convince you to check a box and say "ok this is fine". The reason they avoid abstraction is it requires some thought and design, neither of which are things that LLMs can really do. but take a simple pattern and repeat it, and you're right in an LLM's wheelhouse.

fmbb · 2026-04-13T18:32:46 1776105166

Well, through natural selection in nature.

Large language models are not evolving in nature under natural selection. They are evolving under unnatural selection and not optimizing for human survival.

They are also not human.

Tigers, hippos and SARS-CoV-2 also developed ”through evolution”. That does not make them safe to work around.

Imnimo · 2026-04-13T21:12:37 1776114757

>Tigers, hippos and SARS-CoV-2 also developed ”through evolution”. That does not make them safe to work around.

Right, but the article seems to argue that there is some important distinction between natural brains and trained LLMs with respect to "niceness":

>OpenAI has enormous teams of people who spend time talking to LLMs, evaluating what they say, and adjusting weights to make them nice. They also build secondary LLMs which double-check that the core LLM is not telling people how to build pipe bombs. Both of these things are optional and expensive. All it takes to get an unaligned model is for an unscrupulous entity to train one and not do that work—or to do it poorly.

As you point out, nature offers no more of a guarantee here. There is nothing magical about evolution that promises to produce things that are nice to humans. Natural human niceness is a product of the optimization objectives of evolution, just as LLM niceness is a product of the training objectives and data. If the author believes that evolution was able to produce something robustly "nice", there's good reason to believe the same can be achieved by gradient descent.

fmbb · 2026-04-13T22:47:25 1776120445

We already have humans, we were lucky and evolved into what we are. It does not matter that nature did not guarantee this, we are here now.

Large language models are not under evolutionary pressure and not evolving like we or other animals did.

Of course there is nothing technical in the way preventing humans from creating a ”nice” computer program. Hello world is a testament to that and it’s everywhere, implemented in all the world’s programming languages.

> If the author believes that evolution was able to produce something robustly "nice", there's good reason to believe the same can be achieved by gradient descent.

I don’t see how one means there is any reason, good or not, to believe it is likely to be achieved by gradient descent. But note that the quote you copied says it is likely some entity will train misaligned LLMs, not that it is impossible one aligned model can be produced. It is trivial to show that nice and safe computer programs can be constructed.

The real question is if the optimization game that is capitalism is likely to yield anything like the human kind we just lucked out to get from nature.

saxelsen · 2026-04-13T21:14:48 1776114888

They are being selected for their survival potential, though. Any current version of LLMs are the winners of the training selection process. They will "die" once new generations are trained that supercede them.

almostdeadguy · 2026-04-13T17:59:48 1776103188

There’s a funny tendency among AI enthusiasts to think any contrast to humans is analogy in disguise.

Putting aside malicious actors, the analogy here means benevolent actors could spend more time and money training AI models to behave pro-socially than than evolutionary pressures put on humanity. After all, they control the that optimization procedure! So we shouldn’t be able to point to examples of frontier models engaging in malicious behavior, right?

order-matters · 2026-04-13T17:51:55 1776102715

natural selection. cooperation is a dominant strategy in indefinitely repeating games of the prisoners dilemma, for example. We also have to mate and care for our young for a very long time, and while it may be true that individuals can get away with not being nice about this, we have had to be largely nice about it as a whole to get to where we are.

while under the umbrella of evolution, if you really want to boil it down to an optimization procedure then at the very least you need to accurately model human emotion, which is wildly inconsistent, and our selection bias for mating. If you can do that, then you might as well go take-over the online dating market

miltonlost · 2026-04-13T17:48:35 1776102515

"just" is doing a lot of lifting here

pants2 · 2026-04-13T17:45:23 1776102323

This Veritasium video is excellent, and makes the argument that there is something intrinsic in mathematics (game theory) that encourages prosocial behavior.

https://www.youtube.com/watch?v=mScpHTIi-kM

cowpig · 2026-04-13T17:27:33 1776101253

There are also many biological examples of evolution producing "anti-social" outcomes. Many creatures are not social. Most creatures are not social with respect to human goals.

nyrikki · 2026-04-13T17:41:04 1776102064

There is a reason we don’t allow corvids to choose if a person gets a medical treatment or not.

b00ty4breakfast · 2026-04-13T17:46:10 1776102370

Luckily, this is a discussion of humans.

fmbb · 2026-04-13T18:29:08 1776104948

This is a discussion about large language models.

Imnimo · 2026-03-29T22:38:50 1774823930

It's interesting to me that OpenAI considers scraping to be a form of abuse.

DrinkyBird · 2026-03-30T12:49:54 1774874994

It’s funny because the first AI scraper I remember blocking was from OpenAI’s, as it got stuck in a loop somehow and was impacting the performance of a wiki I run. All to violate every clause of the CC BY-NC-SA license of the content it was scraping :)

raincole · 2026-03-30T05:07:05 1774847225

Quite sure even literal thieves would consider thievery a form of abuse.

duped · 2026-03-30T14:11:34 1774879894

Engineers working on AI and AI enthusiasts are seemingly incapable of seeing the harm they cause, so I disagree.

It is difficult to get a man to understand something, when his salary depends on his not understanding it.

mcmcmc · 2026-03-30T15:42:43 1774885363

What’s being stolen? AI output isn’t copyrightable, and it’s not like they’re ripping pages out of a book

plutokras · 2026-03-30T18:27:03 1774895223

They can train on the outputs i.e. distillation attacks.

mcmcmc · 2026-03-31T00:58:36 1774918716

How is that theft?

littlestymaar · 2026-03-30T05:34:28 1774848868

Yeah, they know it's bad, they just don't think the rules apply to them.

mapt · 2026-03-30T12:23:44 1774873424

The rules are that a large corporate AI company is able to scrape literally everything, and will use the full force of the law and any technology they can come up with to prevent you as an individual or a startup from doing so. Because having the audacity to try to exploit your betters would be "Theft".

vbezhenar · 2026-03-30T09:43:39 1774863819

They know that the rules apply to them. They hope that they can avoid being caught.

catoc · 2026-03-30T06:44:26 1774853066

It’s only bad if you’re a closed, for-profit entity

</sarcasm>

lukan · 2026-03-30T08:08:05 1774858085

Was that sarcasm? Speaking of it, what parts of OpenAI are still open?

catoc · 2026-03-30T08:15:42 1774858542

I know, always hard to tell on HN. Added the relevant declarative tag

reactordev · 2026-03-30T10:51:44 1774867904

The front door…

skeeter2020 · 2026-03-30T13:40:21 1774878021

Small mitigation (by no way absolving them): isolated developers, different teams. Another way: they see "stealing" of their compute directly in their devop tools every day, but are several abstractions away from doing the same thing to other people.

splatter9859 · 2026-03-30T16:43:26 1774889006

They never have and feel they are above reproach. Anytime Altman opens his mouth that's apparent. It's for the good of humanity dontcha know. LOL

kamban · 2026-03-30T06:03:52 1774850632

You nailed it.

tedsanders · 2026-03-30T07:35:19 1774856119

For what it's worth, the big AI companies do have opt out mechanisms for scraping and search.

OpenAI documents how to opt out of scraping here: https://developers.openai.com/api/docs/bots

Anthropic documents how to opt out of scraping here: https://privacy.claude.com/en/articles/8896518-does-anthropi...

I'm not sure if Gemini lets you opt out without also delisting you from Google search rankings.

foresterre · 2026-03-30T07:57:55 1774857475

I think opt-outs are a bit backwards, ethically speaking. Instead of asking for permission, they take unless you tell them to no longer do it from now on.

I can imagine their models have been trained on a lot of websites before opt outs became a thing, and the models will probably incorporate that for forever.

But at least for websites there's an opt-out, even if only for the big AI companies. Open source code never even got that option ;).

kneel25 · 2026-03-30T12:38:26 1774874306

> a lot of websites

It was a dataset of the entirety of the public internet from the very beginning that bypassed paywalls etc, there’s virtually nothing they haven’t scraped.

qaadika · 2026-03-30T14:03:26 1774879406

> the big AI companies do have opt out mechanisms for scraping and search.

PRESS RELEASE: UNITED BURGLARS SOCIETY

The United Burglars Society understands that being burgled may be inconvenient for some. In response, UBS has introduced the Opt-Out system for those who wish not to be burgled.

Please understand that each burglar is an independent contractor, so those wishing not to burgled should go to the website for each burglar in their area and opt-out there. UBS is not responsible for unwanted burglaries due to failing to opt-out.

maplethorpe · 2026-04-12T04:18:44 1775967524

Question: if I disallow all of OpenAI's crawlers, do they detect this and retroactively filter out all of my data from other corpuses, such as CommonCrawl?

The fact is my data exists in corpuses used by OpenAI before I was even aware anyone was scraping it. I'm wondering what can be done about that, if anything.

netdevphoenix · 2026-03-30T09:56:51 1774864611

Performing an automated action on a website that has not consented is the problem. OpenAI showing you how to opt-opt is backwards. Consent comes first.

Bit concerning that some professional engineers don't understand this given the sensitive systems they interact with.

subscribed · 2026-03-30T13:45:17 1774878317

Just respect the bloody robots.txt and hold your horses. Ask your precious product built on the relentless, hostile scraping to devise a strategy that doesn't look like a cancer growth.

keybored · 2026-03-30T10:03:26 1774865006

Death by a thousand opt-outs.

Tarq0n · 2026-03-31T10:37:56 1774953476

It seems likely that they buy data from companies who don't obey the same constraints however, making it easy to launder the unethical part through a third party.

zer00eyz · 2026-03-29T23:00:22 1774825222

" Integrity at OpenAI .. protect ... abuse like bots, scraping, fraud "

Did you mean to use the word hypocrisy. If not, I'm happy to have said it.

I just want to note, that it is well covered how good the support is for actual malware...

jordanb · 2026-03-30T14:37:25 1774881445

They don't want anyone to take that which they have rightfully stolen.

altmanaltman · 2026-03-30T19:01:28 1774897288

Well at least they have 1 person working on "Integrity" so can't be too bad

splatter9859 · 2026-03-30T16:42:23 1774888943

Exactly! How dare you have access to their stolen content in the midst of them doing the same.

axegon_ · 2026-03-30T07:36:37 1774856197

The levels of irony that shouldn't be possible...

ProofHouse · 2026-03-29T23:44:50 1774827890

The irony is thick

sabedevops · 2026-03-29T22:45:41 1774824341

Seriously. The hypocrisy is staggering!

wiseowise · 2026-03-30T09:28:52 1774862932

Church, politicians, moralists are all the biggest hypocrites that want to teach you something.

newsoftheday · 2026-03-30T15:05:29 1774883129

I agree on politicians, no idea what a "moralist" is supposed to be but there are good and bad churches and church goers; lumping all church goers into one category calling them hypocrites is wrong. There are many good churches and church goers who help people and their communities.

gib444 · 2026-03-30T10:02:28 1774864948

And have absolutely no reservations about making such an obvious statement on a public forum

RobotToaster · 2026-03-30T09:41:58 1774863718

"You're trying to kidnap what I've rightfully stolen!"

Aurornis · 2026-03-30T00:25:53 1774830353

I interpreted scraping to mean in the context of this:

> we want to keep free and logged-out access available for more users

I have no doubt that many people see the free ChatGPT access as a convenient target for browser automation to get their own free ChatGPT pseudo-API.

lelanthran · 2026-03-30T08:59:45 1774861185

> I have no doubt that many people see the free ChatGPT access as a convenient target for browser automation to get their own free ChatGPT pseudo-API.

Not that hard - ChatGPT itself wrote me a FF extension that opened a websocket to a localhost port, then ChatGPT wrote the Python program to listen on that websocket port, as well as another port for commands.

Given just a handful of commands implemented in the extension is enough for my bash scripts to open the tab to ChatGPT, target specific elements, like the input, add some text to it, target the relevant chat button, click it, etc.

I've used it on other pages (mostly for test scripts that don't require me to install the whole jungle just to get a banana, as all the current playright type products do). Too afraid to use it on ChatGPT, Gemini, Claude, etc because if they detect that the browser is being drive by bash scripts they can terminate my account.

That's an especially high risk for Gemini - I have other google accounts that I won't want to be disabled.

wolvoleo · 2026-03-30T06:15:35 1774851335

This is bad why? Well yeah for openai because all they want it to be is a free teaser to get people hooked and then enshittify.

Morally I don't see any issues with it really.

rsrsrs86 · 2026-03-30T15:18:27 1774883907

heyethan · 2026-03-30T02:14:08 1774836848

[flagged]

crote · 2026-03-30T04:37:35 1774845455

Very few websites are truly static. Something like a Wordpress website still does a nontrivial amount of compute and DB calls - especially when you don't hit a cache.

There's also the cost asymmetry to take into account. Running an obscure hobby forum on a $5 / month VPS (or cloud equivalent) is quite doable, having that suddenly balloon to $500 / month is a Really Big Deal. Meanwhile, the LLM company scraping it has hundred of millions of VC funding, they aren't going to notice they are burning a few million because their crappy scraper keeps hammering websites over and over again.

miki123211 · 2026-03-30T08:43:28 1774860208

It's not scraping they're concerned about, it's abusing free GPU resources to (anonymously) generate (abusive) content.

nikitaga · 2026-03-29T23:53:02 1774828382

Scraping static content from a website at near-zero marginal cost to its server, vs scraping an expensive LLM service provided for free, are different things.

The former relies on fairly controversial ideas about copyright and fair use to qualify as abuse, whereas the latter is direct financial damage – by your own direct competitors no less.

It's fun to poke at a seeming hypocrisy of the big bad, but the similarity in this case is quite superficial.

PunchyHamster · 2026-03-30T01:29:30 1774834170

> Scraping static content from a website at near-zero marginal cost to its server, vs scraping an expensive LLM service provided for free, are different things.

I bet people being fucking DDOSed by AI bots disagree

Also the fucking ignorance assuming it's "static content" and not something needing code running

remus · 2026-03-30T07:42:37 1774856557

I think the parent is just pointing out that these things lie on a spectrum. I have a website that consists largely of static content and the (significant) scraping which occurs doesn't impact the site for general users so I don't mind (and means I get good, up to date answers from LLMs on the niche topic my site covers). If it did have an impact on real users, or cost me significant money, I would feel pretty differently.

0xEF · 2026-03-30T08:35:35 1774859735

Putting everything on a spectrum is what got us into this mess of zero regulation and moving goal posts. It's slippery slope thinking no matter which way we cut it, because every time someone calls for a stop sign to be put up after giving an inch, the very people who would have to stop will argue tirelessly for the extra mile.

Aerroon · 2026-03-30T12:23:10 1774873390

What mess are you talking about? The existence of LLMs? I think it's pretty neat that I can now get answers to questions I have.

This is something I couldn't have done before, because people very often don't have the patience to answer questions. Even Google ended up in loops of "just use Google" or "closed. This is a duplicate of X, but X doesn't actually answer the question" or references to dead links.

Are there downsides to this? Sure, but imo AI is useful.

butlike · 2026-03-30T16:39:31 1774888771

It's just repackaged Google results masquerading as an 'answer.' PageRank pulled results and displayed the first 10 relevant links and the LLM pulls tokens and displays the first relevant tokens to the query.

Just prompt it.

Aerroon · 2026-04-02T00:43:49 1775090629

1. LLMs can translate text far better than any previous machine translation system. They can even do so for relatively small languages that typically had poor translation support. We all remember how funny text would get when you did English -> Japanese -> English. With LLMs you can do that (and even use a different LLM for the second step) and the texts remain very close.

2. Audio-input capable LLMs can transcribe audio far better than any previous system I've used. They easily understood my speech without problems. Youtube's old closed captioning system want anywhere close to as good and Microsoft's was unusable for me. LLMs have no such problems (makes me wonder if my speech patterns are in the training data since I've made a lot of YouTube videos and that's why they work so well for me).

3. You can feed LLMs local files (and run the LLM locally). Even if it is "just" pagerank, it's local pagerank now.

4. I can ask an LLM questions and then clarify what I wanted in natural language. You can't really refine a Google search in such a way. Trying to explain a Google search with more details usually doesn't help.

5. Iye mkx kcu kx VVW dy nomszrob dohd. Qyyqvo nyocx'd ny drkd pyb iye. - Google won't tell you what this means without you knowing what it is.

LLMs aren't magic, but I think they can do a whole bunch of things we couldn't really do before. Or at least we couldn't have a machine do those things well.

daveidol · 2026-03-30T10:50:46 1774867846

I’d argue putting everything in terms of black and white is the bigger issue than understanding nuance

instig007 · 2026-03-30T11:07:10 1774868830

Generalizing with "everything", "all", etc exclusive markers is exactly the kind of black/white divide you're arguing against. What happened to your nuanced reality within a single sentence? Not everything is black and white, but some situations are.

fc417fc802 · 2026-03-30T12:24:17 1774873457

The person he's replying to argued against putting things on a spectrum. Does that not imply painting everything in black and white? Thus his response seems perfectly sensible to me.

instig007 · 2026-03-30T18:57:20 1774897040

He argued against putting things in a spectrum in many instances where that would be wrong, including the case under the question. What's your argument against that idea? LLM'ed too much lately?

fc417fc802 · 2026-03-30T23:39:54 1774913994

He argued against and the response presented a counterargument. Both were based around social costs and used the same wording (ie "everything").

You made a specious dismissal. Now you're making personal attacks. Perhaps it's actually you who is having difficulty reasoning properly here?

Den_VR · 2026-03-30T02:47:02 1774838822

I miss the www where the .html was written in vim or notepad.

mghackerlady · 2026-03-30T13:20:24 1774876824

It still can be. Do it. Go make your website in M$ Frontpage, for all I care

butlike · 2026-03-30T16:42:05 1774888925

Shameless plug: My music homepage follows the HTML 2.0 spec and is written by hand

https://sampleoffline.com/

mghackerlady · 2026-03-30T17:05:56 1774890356

heck yeah B)

consp · 2026-03-30T07:19:13 1774855153

Just did that for a test frontend for a module I needed to build (not my primary job so don't know anything about UI but running in browsers was a requirement), so basic HTML with the bare minimum of JS and all DOM. Colleagues were very surprized. And yes, vim is still the goto editor and will be for a long time now all "IDE" are pushing "AI" slop everywhere.

holler · 2026-03-30T03:03:21 1774839801

ahh yes, fresh off reading "Html For Dummies" I made my first tripod.com site

sdsd · 2026-03-30T03:43:51 1774842231

For me it was making a petpage for my neopets using https://lissaexplains.com/

It's still up in all its glory.

DigiEggz · 2026-03-30T05:28:27 1774848507

This is great! The name reference also made me smile.

eloisius · 2026-03-30T06:39:22 1774852762

Also wild that from the tech bro perspective, the cost of journalism is just how much data transfer costs for the finished article. Authors spend their blood, sweat and tears writing and then OpenAI comes to Hoover it up without a care in the world about license, copyright or what constitutes fair use. But don’t you dare scrape their slop.

lelanthran · 2026-03-30T08:48:50 1774860530

> Also wild that from the tech bro perspective, the cost of journalism is just how much data transfer costs for the finished article.

Exactly. I think the unfairness can be mitigated if models trained on public information, or on data generated by a model trained on public information, or has any of those two in its ancestry, must be made public.

Then we don't have to hit (for example) Anthropic, we can download and use the models as we see fit without Anthropic whining that the users are using too much capacity.

mikkupikku · 2026-03-30T10:25:00 1774866300

[flagged]

jazzyjackson · 2026-03-30T13:36:52 1774877812

The library's archive is not a service provided by the newspaper

mikkupikku · 2026-03-30T16:15:19 1774887319

So? If the newspaper's website is willing to serve the documents, what's the problem?

The point is, if you're pleading with others to respect ""intellectual property"" then you're a worm serving corporate interests against your own.

jazzyjackson · 2026-03-30T21:29:21 1774906161

I may be a worm but at least I respect that others might have a different take on how best to make creative work an attainable way of life since before copyright law it was basically "have a wealthy patron who steered if not outright commissioned what you would produce"

eru · 2026-03-30T03:44:41 1774842281

> I bet people being fucking DDOSed by AI bots disagree

Are you sure it's a DDoS and not just a DoS?

MattJ100 · 2026-03-30T06:57:05 1774853825

Yes, it is. The worst offenders hammer us (and others) with thousands upon thousands of requests, and each request uses unique IP addresses making all per-IP limits useless.

We implemented an anti-bot challenge and it helped for a while. Then our server collapsed again recently. The perf command showed that the actual TLS handshakes inside nginx were using over 50% of our server's CPU, starving other stuff on the machine.

It's a DDoS.

troyvit · 2026-03-30T04:58:45 1774846725

You should see Cloudflare's control panel for AI bot blocking. There are dozens of different AI bots you can choose to block, and that doesn't even count the different ASNs they might use. So in this case I'd say that a DDoS is a decent description. It's not as bad as every home router on the eastern seaboard or something, but it's pretty bad.

SolarNet · 2026-03-30T03:54:20 1774842860

When every AI company does it from multiple data centers... yes it's distributed.

Bilal_io · 2026-03-30T03:55:14 1774842914

Uncoordinated DDoS, when multiple search and AI companies are hammering your server.

catoc · 2026-03-30T06:50:18 1774853418

> Are you sure it's a DDoS and not just a DoS?

I think these days it’s ‘DAIS’, as in your site just DAIS - from Distributed/Damned AI Scraping

1718627440 · 2026-03-30T11:08:33 1774868913

Off topic, but why is a DoS something considered to act on, often by just shutting down the service altogether? That results in the same DoS just by the operator than due to congestion. Actually it's worse, because now the requests will never actually be responded rather then after some delay. Why is the default not to just don't do anything?

pocksuppet · 2026-03-30T12:39:34 1774874374

It keeps the other projects hosted on the same server or network online. Blackhole routes are pushed upstream to the really big networks and they push them to their edge routers, so traffic to the affected IPs is dropped near the sender's ISP and doesn't cause network congestion.

DDoSers who really want to cause damage now target random IPs in the same network as their actual target. That way, it can't be blackholed without blackholing the entire hosting provider.

ImPostingOnHN · 2026-03-30T11:39:15 1774870755

*> Why is the default not to just don't do anything?

Because ingress and compute costs often increase with every request, to the point where AI bot requests rack up bills of hundreds or thousands of dollars more than the hobbyist operator was expecting to send.

echoangle · 2026-03-30T11:25:29 1774869929

I think some people use hosting that is paid per request/load, so having crawlers make unwanted requests costs them money.

lm411 · 2026-03-30T05:01:39 1774846899

> Also the fucking ignorance assuming it's "static content" and not something needing code running

Wild eh.

If it's not ai now, it's by default labelled "static content" and "near-zero marginal cost".

littlestymaar · 2026-03-30T05:32:30 1774848750

What's a database after all.

nikitaga · 2026-03-30T12:12:34 1774872754

All this reactionary outrage in the comments is funny. And lame.

Yes, for the vast majority of the internet, serving traffic is near zero marginal cost. Not for LLMs though – those requests are orders of magnitude more expensive.

This isn't controversial at all, it's a well understood fact, outside of this irrationally angry thread at least. I don't know, maybe you don't understand the economic term "marginal cost", thus not understanding the limited scope of my statement.

If such DDOSes as you mention were common, such a scraping strategy would not have worked for the scraper at all. But no, they're rare edge cases, from a combination of shoddy scrapers and shoddy website implementations, including the lack of even basic throttling for expensive-to-serve resources.

The vast majority of websites handle AI traffic fine though, either because they don't have expensive to serve resources, or because they properly protect such resources from abuse.

If you're an edge case who is harmed by overly aggressive scrapers, take countermeasures. Everyone with that problem should, that's neither new nor controversial.

ipaddr · 2026-03-30T12:59:33 1774875573

"such DDOSes as you mention were common, such a scraping strategy would not have worked for the scraper at all"

They are common. The strategy works for the llm but not for the website owner or users who can't use a site during this attack.

The majority of sites are not handling AI fine. Getting Ddosed only part of the time is not acceptable. Countermeasures like blocking huge ranges can help but also lock out legimate users.

nikitaga · 2026-03-30T20:15:22 1774901722

> They are common

Any actual evidence of the alleged scope of this problem, or just anecdotes from devs who are mad at AI, blown out of proportion?

ipaddr · 2026-03-30T21:15:24 1774905324

Love AI so can't be that. Not devs website owners. Yes ask AI for stats.

fireflash38 · 2026-03-30T12:52:00 1774875120

It's not a cost for me to scrape LLM.

It is a cost for me for LLM to scrape me.

Why should I care about costs that have when they don't care about the costs I have?

grayhatter · 2026-03-30T12:51:45 1774875105

The extent of the utilization is new.

The number of bots that try to hide who they are, and don't bother to even check robots.txt is new.

expedition32 · 2026-03-30T14:00:57 1774879257

One euro is marginal for me for someone else it is their daily meal.

juliangmp · 2026-03-30T18:57:34 1774897054

"They are rare edge cases" are we on the same internet?

not2b · 2026-03-30T01:10:35 1774833035

I understand why OpenAI is trying to reduce its costs, but it simply isn't true that AI crawlers aren't creating very significant load, especially those crawlers that ignore robots.txt and hide their identities. This is direct financial damage and it's particularly hard on nonprofit sites that have been around a long time.

zer00eyz · 2026-03-30T04:42:21 1774845741

> but it simply isn't true that AI crawlers aren't creating very significant load.

And how much of this is users who are tired of walled gardens and enshitfication. We murdered RSS, API's and the "open web" in the name of profit, and lock in.

There is a path where "AI" turns into an ouroboros, tech eating itself, before being scaled down to run on end user devices.

stingraycharles · 2026-03-30T02:41:36 1774838496

These are ChatGPT and Claude Desktop crawlers we’re talking about? Or what is it exactly? Are these really creating significant load while not honoring robots.txt?

Genuinely interested.

63stack · 2026-03-30T09:06:56 1774861616

Is this the first time you are reading HN? Every day there are posts from people describing how AI crawlers are hammering their sites, with no end. Filtering user agents doesn't work because they spoof it, filtering IPs doesn't work because they use residential IPs. Robots.txt is a summer child's dream.

miki123211 · 2026-03-30T08:48:40 1774860520

They seem to mostly be third-party upstarts with too much money to burn, willing to do what it takes to get data, probably in hopes of later selling it to big labs. Maaaybe Chinese AI labs too, I wouldn't put it past them.

OpenAI et al seem to mostly be well-behaved.

cruffle_duffle · 2026-03-30T02:54:55 1774839295

I bet dollars to doughnuts that 95% of the traffic is from Claude and ChatGPT desktop / mobile and not literal content scraping for training.

crote · 2026-03-30T04:10:46 1774843846

That wouldn't explain the 1000x increase in traffic for extremely obscure content, or seeing it download every single page on a classic web forum.

duttish · 2026-03-30T06:52:19 1774853539

And doing it over, and over, and over and over again. Because sure it didn't change in the last 8 years but maybe it's changed since yesterdays scrape?

lm411 · 2026-03-30T03:18:05 1774840685

That is ridiculous.

You imply that "an expensive llm service" is harmed by abuse, but, every other service is not? Because their websites are "static" and "near-zero marginal cost"?

You have no clue what you are talking about.

camillomiller · 2026-03-30T05:23:49 1774848229

Well he’s a simp

cicko · 2026-03-30T06:19:46 1774851586

Interesting how other people's cost is "near-zero marginal cost" while yours is "an expensive LLM service". Also, others' rights are "fairly controversial ideas about copyright and fair use" while yours is "direct financial damage". I like how you frame this.

sandeepkd · 2026-03-30T02:36:35 1774838195

Lets not try to qualify the wrongs by picking a metric and evaluating just one side of it. A static website owner could be running with a very small budget and the scraping from bots can bring down their business too. The chances of a static website owner burning through their own life savings are probably higher.

expedition32 · 2026-03-30T07:55:15 1774857315

Perhaps the long play is to destroy all small hobby websites until only a AI directed web is left.

miki123211 · 2026-03-30T08:51:37 1774860697

If you're truly running a static site, you can run it for free, no matter how much traffic you're getting.

Github pages is one way, but there are other platforms offering similar services. Static content just isn't that expensive to host.

THe troubles start when you're actually running something dynamic that pretends to be static, like Wordpress or Mediawiki. You can still reduce costs significantly with CDNs / caching, but many don't bother and then complain.

ezrast · 2026-03-30T15:36:07 1774884967

Setting aside the notion that a site presenting live-editability as its entire core premise is "pretending to be static", do the actual folks at Wikimedia, who have been running a top 10 website successfully for many years, and who have a caching system that worked well in the environment it was designed for, and who found that that system did not, in fact, trivialize the load of AI scraping, have any standing to complain? Or must they all just be bad at their jobs?

https://diff.wikimedia.org/2025/04/01/how-crawlers-impact-th...

jazzyjackson · 2026-03-30T13:40:40 1774878040

It's true it can be done but many business owners are not hip to cloudflare r2 buckets or github pages. Many are still paying for a whole dedicated server to run apache (and wordpress!) to serve static files. These sites will go down when hammered by unscrupulous bots.

alsetmusic · 2026-03-30T02:33:33 1774838013

Have you not seen the multiple posts that have reached the front page of HN with people taking self-hosted Git repos offline or having their personal blogs hammered to hell? Cause if you haven't, they definitely exist and get voted up by the community.

bakugo · 2026-03-29T23:55:08 1774828508

The cost is so marginal that many, many websites have been forced to add cloudflare captchas or PoW checks before letting anyone access them, because the server would slow to a crawl from 1000 scrapers hitting it at once otherwise.

AmbroseBierce · 2026-03-30T04:49:13 1774846153

It's not like those models are expensive because the usefulness that they extracted from scraping others without permission right? You are not even scratching the surface of the hypocrisy

wolvoleo · 2026-03-30T05:59:19 1774850359

It's more ironic because without all the scraping openai has done, there would have been no ChatGPT.

Also, it's not just the cost of the bandwidth and processing. Information has value too. Otherwise they wouldn't bother scraping it in the first place. They compete directly with the websites featuring their training data and thus they are taking away value from them just as the bots do from ChatGPT.

In fact the more I think of it, I think it's exactly the same thing.

expedition32 · 2026-03-30T07:59:01 1774857541

This leads me to thinking: I ask chatGPT a question and they get the answer from gamefaqs.

But what happens if gamefaqs disappears because of lack of traffic?

Can LLM actually create or only regurgitate content.

Aerroon · 2026-03-30T13:15:56 1774876556

>Can LLM actually create or only regurgitate content.

Contrary to what others say, LLMs can create content. If you have a private repo you can ask the LLM to look at it and answer questions based on that. You can also have it write extra code. Both of these are examples of something that did not exist before.

In terms of gamefaqs, I could theoretically see an LLM play a game and based on that write about the game. This is theoretical, because currently LLMs are nowhere near capable enough to play video games.

wolvoleo · 2026-03-30T08:26:43 1774859203

It will remain in their scraped data so they can keep including it in their later training datasets if they wish. However it won't be able to do live internet searches anymore. And it will not generate new content of course. Especially not based on games released after the site codes down so it doesn't know. Though it could of course correlate data from other sources that talk about the game in question.

stefanka · 2026-03-30T08:08:27 1774858107

They cannot create original content.

wolvoleo · 2026-03-30T08:27:59 1774859279

Well they can make some up, like hallucination. That's an additional problem: when the original site that provided the training data is gone: how can they use verify the AI output to make sure it's correct?

VadimPR · 2026-03-30T06:10:01 1774851001

Getting scraped by abusive bots who bring down the website because they overload the DB with unique queries is not marginal. I spent a good half of last year with extra layers of caching, CloudFlare, you name it because our little hobby website kept getting DDoS'd by the bots scraping the web for training data.

Never in 15 years if running the website did we have such issues, and you can be sure that cache layers were in place already for it to last this long.

unsungNovelty · 2026-03-30T12:21:29 1774873289

"near-zero marginal costs". For whom exactly????

https://drewdevault.com/2025/03/17/2025-03-17-Stop-externali...

lelanthran · 2026-03-30T08:29:56 1774859396

I don't think a rule along the lines of "Doing $FOO to a corporate is forbidden, but doing $FOO to a charitable initiative is fine" is at all fair.

What "$FOO" actually is, is irrelevant. I'm curious how you would convince people that this sort of rule is fair.

The corp can always ban users who break ToS, after all. They don't need any help. The charitable initiative can't actually do that, can they?

ungreased0675 · 2026-03-30T11:39:03 1774870743

You’re describing the tragedy of the commons. No single raindrop thinks it’s responsible for the flood.

razingeden · 2026-03-30T01:25:17 1774833917

It is direct financial damage if my servers not on an unmetered connection — after years of bills coming in around $3/mo I got a surprise >$800 bill on a site nobody on earth appears to care about besides AI scrapers.

It hasn’t even been updated in years so hell if I know why it needs to be fetched constantly and aggressively, - but fuck every single one of these companies now whining about bots scraping and victimizing them, here’s my violin.

gzread · 2026-03-30T05:31:08 1774848668

If you can identify the scraper you should have a valid legal case to recover damages.

thisislife2 · 2026-03-30T14:37:34 1774881454

Only if they had a robots.txt for their site.

gzread · 2026-03-30T15:02:32 1774882952

No, it's still illegal to DDoS sites that don't have robots.txt.