Hacker Newsnew | past | comments | ask | show | jobs | submit | onchainintel's commentslogin

Using verified V4 pricing compared to Anthropic Claude:

vs Haiku 4.5: 3.3x cheaper input, 10x cheaper output vs Sonnet 4.6: 10x cheaper input, 30x cheaper output vs Opus 4.7: 17x cheaper input, 50x cheaper output

Mind-blowingly cheaper by comparison.


Thanks for the bit of nostalgia today OP. I remember the first time that I saw that browser screen. Pure discovery back in those early days of the web. I can still hear the dial-up modem crackling...

Yes, that crackling! God, that shreak. You hear it and it just pulls you right back. Like some kind of rhythm that told you something was happening, something exciting was about to occur.

For us bridge-generation kids, that sound is probably etched like vinyl. Quiet room, 2 AM, and then that thrum, shreek and hiss. I literally missed it, whatever the next thing was. Whenever "modems" became obsolete. It was sad. It was the audio reminder, the signal hanging in the air, of the literal lifeline out of your analog bedroom and into a cosmos filled with electricity, buzzing with knowledge and light.

For me, half the experience of that era was purely sensory. The clunky physical sounds of the machine doing the heavy lifting to connect you... the clunky graphics....the need to wait...the gradual adjustment to the pace of life and the "gentle introduction" these "reduce speed" effects had to the threshold moment that that was, were somehow the right gentleness to take the world on such an epic journey.

I have labored a lot to recapture that feeling. Across many projects. Idk why exactly, but there was something so hopeful and exciting about the internet at that time. And I know it's worth remembering. Like a precious flame you have to protect from the rain, I guess. Check out this one: https://win9-5.com/desktop.html

Just a small set of experiments to see if I can grab that feeling. The modem sound evoke the vibe. Browsing the modern web with it is a little strange, if you can do that "in the gallery watching the walls between the paintings" kind of mind-job and not focus too much on the web portal content (which is designed to always suck you in, even framed retro like this).


Incredible. "Hi honey, what did you do at work today? Casually discovered the edge of the galaxy. How are you?"

I think they have been looking for the edge for years, and the discovery had come gradually over some time. So I don't think that "casually" fits there and "today" doesn't make things better.

Insert obligatory "this is the way" Mando scene. Indeed!

How does it compare to Opus 4.7? I've been immersed in 4.7 all week participating in the Anthropic Opus 4.7 hackathon and it's pretty impressive even if it's ravenous from a token perspective compared to 4.6

The thing is, it doesnt need to beat 4.7. it just needs to do somewhat well against it.

This is free... as in you can download it, run it on your systems and finetune it to be the way you want it to be.


> you can download it, run it on your systems

In theory, sure, but as other have pointed out you need to spend half a million on GPUs just to get enough VRAM to fit a single instance of the model. And you’d better make sure your use case makes full 24/7 use of all that rapidly-depreciating hardware you just spent all your money on, otherwise your actual cost per token will be much higher than you think.

In practice you will get better value from just buying tokens from a third party whose business is hosting open weight models as efficiently as possible and who make full use of their hardware. Even with the small margin they charge on top you will still come out ahead.


There are a lot of companies who would gladly drop half a million on a GPU to have private inference that Anthropic or OpenAI can’t use to steal their data.

And that GPU wouldn’t run one instance, the models are highly parallelizable. It would likely support 10-15 users at once, if a company oversubscribed 10:1 that GPU supports ~100 seats. Amortized over a couple years the costs are competitive.


> There are a lot of companies who would gladly drop half a million on a GPU to have private inference that Anthropic or OpenAI can’t use to steal their data.

Obviously, and certainly companies do run their own models because they place some value on data sovereignty for regulatory or compliance or other reasons. (Although the framing that Anthropic or OpenAI might "steal their data" is a bit alarmist - plenty of companies, including some with _highly_ sensitive data, have contracts with Anthropic or OpenAI that say they can't train future models on the data they send them and are perfectly happy to send data to Claude. You may think they're stupid to do that, but that's just your opinion.)

> the models are highly parallelizable. It would likely support 10-15 users at once.

Yes, I know that; I understand LLM internals pretty well. One instance of the model in the sense of one set of weights loaded across X number of GPUs; of course you can then run batch inference on those weights, up to the limits of GPU bandwidth and compute.

But are those 100 users you have on your own GPUs usings the GPUs evenly across the 24 hours of the day, or are they only using them during 9-5 in some timezone? If so, you're leaving your expensive hardware idle for 2/3 of the day and the third party providers hosting open weight models will still beat you on costs, even without getting into other factors like they bought their GPUs cheaper than you did. Do the math if you don't believe me.


There's stuff like SOC controls and enterprise contracts with enforceable penalties if clauses are breached. ZDR is a thing.

The most significant value of open source models come from being able to fine-tune; with a good dataset and limited scope; a finetune can be crazily worth it.


Sure, but that’s an incredibly short term viewpoint.

Do you think a lot of people have “systems” to run a 1.6T model?

To me, the important thing isn't that I can run it, it's that I can pay someone else to run it. I'm finding Opus 4.7 seems to be weirdly broken compared to 4.6, it just doesn't understand my code, breaks it whenever I ask it to do anything.

Now, at the moment, i can still use 4.6 but eventually Anthropic are going to remove it, and when it's gone it will be gone forever. I'm planning on trying Deepseek v4, because even if it's not quite as good, I know that it will be available forever, I'll always be able to find someone to run it.


Yep, it's wild how little emphasis is there on control and replicability in these posts.

Already these models are useful for a myriad of use cases. It's really not that important if a model can 1-shot a particular problem or draw a cuter pelican on a bike. Past a degree of quality, process and reliability are so much more important for anything other than complete hands-off usage, which in business it's not something you're really going to do.

The fact that my tool may be gone tomorrow, and this actually has happened before, with no guarantees of a proper substitute... that's a lot more of a concern than a point extra in some benchmark.


No, but businesses do. Being able to run quality LLMs without your business, or business's private information, being held at the mercy of another corp has a lot of value.

What type of system is needed to self host this? How much would it cost?

Depends how many users you have and what is "production grade" for you but like 500k gets you a 8x B200 machine.

Depends on fast you want it to be. I’m guessing a couple of $10k mac studio boxes could run it, but probably not fast enough to enjoy using it.

One GB200 NVL72 from Nvidia would do it. $2-3 million, or so. If you're a corporation, say Walmart or PayPal, that's not out of the question.

If you want to go budget corporate, 7 x H200 is just barely going to run it, but all in, $300k ought to do it.


How many users can you serve with that?

For the H200, between 150-700. The GB200 gets you something like 2-10k users.

$20K worth of RTX 6000 Blackwell cards should let you run the Flash version of the model.

Not really - on prem llm hosting is extremely labor and capital intensive

But can be, and is, done. I work for a bootstrapped startup that hosts a DeepSeek v3 retrain on our own GPUs. We are highly profitable. We're certainly not the only ones in the space, as I'm personally aware of several other startups hosting their own GLM or DeepSeek models.

Why a retrain? What are you using the model for?

Completely agree, not suggesting it needs ot just genuinely curious. Love that it can be run locally though. Open source LLMs punching back pretty hard against proprietary ones in the cloud lately in terms of performance.

What's the hardware cost to running it?

Probably like 100 USD/hour

I was curious, and some [intrepid soul](https://wavespeed.ai/blog/posts/deepseek-v4-gpu-vram-require...) did an analysis. Assuming you do everything perfectly and take full advantage of the model's MoE sparsity, it would take:

- To run at full precision: "16–24 H100s", giving us ~$400-600k upfront, or $8-12/h from [us-east-1](https://intuitionlabs.ai/articles/h100-rental-prices-cloud-c...).

- To run with "heavy quantization" (16 bits -> 8): "8xH100", giving us $200K upfront and $4/h.

- To run truly "locally"--i.e. in a house instead of a data center--you'd need four 4090s, one of the most powerful consumer GPUs available. Even that would clock in around $15k for the cards alone and ~$0.22/h for the electricity (in the US).

Truly an insane industry. This is a good reminder of why datacenter capex from since 2023 has eclipsed the Manhattan Project, the Apollo program, and the US interstate system combined...


All these number are peanuts to a mid sized company. A place I worked at used to spend a couple million just for a support contract on a Netapp.

10 years from now that hardware will be on eBay for any geek with a couple thousand dollars and enough power to run it.


That article is a total hallucination.

"671B total / 37B active"

"Full precision (BF16)"

And they claim they ran this non-existent model on vLLM and SGLang over a month and a half ago.

It's clickbait keyword slop filled in with V3 specs. Most of the web is slop like this now. Sigh.


"if you have to ask..."

... if you have 800 GB of VRAM free.

I remember reading about some new frameworks have been coming out to allow Macs to stream weights of huge models live from fast SSDs and produce quality output, albeit slowly. Apart from that...good luck finding that much available VRAM haha

Tbh I was more productive with 4.6 than ever before and if AI progress locks in permanently at 4.6 tier, I’d be pretty happy

It is more than good enough and has effectively caught up with Opus 4.6 and GPT 5.4 according to the benchmarks.

It's about 2 months behind GPT 5.5 and Opus 4.7.

As long as it is cheap to run for the hosting providers and it is frontier level, it is a very competitive model and impressive against the others. I give it 2 years maximum for consumer hardware to run models that are 500B - 800B quantized on their machines.

It should be obvious now why Anthropic really doesn't want you to run local models on your machine.


Vibes > Benchmarks. And it's all so task-specific. Gemini 3 has scored very well in benchmarks for very long but is poor at agentic usecases. A lot of people prefering Opus 4.6 to 4.7 for coding despite benchmarks, much more than I've seen before (4.5->4.6, 4->4.5).

Doesn't mean Deepseek v4 isn't great, just benchmarks alone aren't enough to tell.


With the ability of the Qwen3.6 27B, I think in 2 years consumers will be running models of this capability on current hardware.

What's going to change in 2 years that would allow users to run 500B-800B parameter models on consumer hardware?

I think its just an estimate

But the question remains

Free to try yourself Paste your prompt --> https://promptqualityscore.com

You summed it up perfectly for the time. It was such a relief, there wasn't anything like it. And now we're here.

What could go wrong on a planetary scale. FFS, going to be agentic battle royale everywhere.

It all depends on what you do aka your use case. If you're in the content creatio business, which is part of my responsibilities, then yes has been massively helpful. For other roles, I can absolutely see no use case or benefit. Context matters, like with everything.

Many comparisons between 4.6 & 4.7 at https://tokens.billchambers.me/leaderboard My prompt was 40% more tokens using Opus 4.7.

Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: