More

computerex · 2026-06-09T17:44:04 1781027044

The flip side is that benchmarks are gamed even by the top labs. Benchmark performance doesn't necessarily correlate with real world performance.

aspenmartin · 2026-06-09T18:00:30 1781028030

Again correct but it overstates the issue. I can say labs don’t want this. This happened arguably unintentionally in Metas llama 4 release, it went horribly, heads rolled, and like several billion dollars were paid for new talent and the org that built llama 4 was destroyed.

Evals come from a million places and new evals and robust perturbations of existing evals abound. They test a variety of tasks in a variety of ways. All of them individually are flawed. Taken together the aggregate signal is highly useful as you more or less marginalize over a lot of different things. Not to mention these companies have plenty of proprietary internal measurements, they build benchmarks themselves to probe their models and then also have flywheel traffic and A/B tests.

You are right to call out benchmarks but to dismiss them or not take them seriously is a mistake.

taormina · 2026-06-09T18:11:39 1781028699

Listen, you can say “but benchmarks, the benchmarks!” all day long, but consumer know when we are being sold a lemon. If it can’t do the most basic of things at least as good as it used to, this is table stakes. Nevermind that if you can’t do the basic stuff, how on earth can you be trusted with more?

aspenmartin · 2026-06-09T18:37:16 1781030236

And you can say “If it can’t do the most basic of things at least as good as it used to, this is table stakes” all day long while people point you to much better evidence to the contrary too, I’d rather be on the other side of that.

taormina · 2026-06-09T19:10:36 1781032236

Listen. I don’t care about evidence. I care about my lived experience for the product I paid for. I used the new product. It’s actively terrible. To the point of not being usable. We’re all ancedata, but what is “better evidence to the contrary”? The known and game-able benchmarks that they know they need to win at, so they train it to. It’s all he said, she said, which is the only reason we keep having this conversation.

aspenmartin · 2026-06-09T19:24:49 1781033089

Yea but it’s not right? You or I or the myriad of other institutions inside and outside of academia can probe these models with an evolving landscape of evaluation sets, even those unavailable to the developers. It’s just ignorance to claim benchmarks are somehow useless or all being gamed. You choose your tools in the way you want, but just don’t call it somehow better than a myriad of more carefully constructed setups and scaled evaluations.

computerex · 2026-06-07T15:39:34 1780846774

Not true. Togetherai, deepinfra, fireworks AI offer a wide range of models like gpt oss that are very capable and far cheaper than the models from big 3.

Der_Einzige · 2026-06-07T21:48:24 1780868904

I'm referring to Chinese open source models hosted on American clouds vs Chinese clouds. You're talking about an old and non-agentic capable American produced model.

computerex · 2026-06-07T23:11:47 1780873907

You are actually referring to open weight models, not open source. Gpt-OSS is an example of an open weight model. It’s highly capable in agentic settings, people use it for coding all the time.

My greater point remains. Models like the qwen variants, minimax, k2.5, glm models are available by American providers like AWS at a much cheaper price than api offerings from the big three LLM providers.

Your point about Chinese models being cheap only on Chinese hardware makes absolutely zero sense. You can check out the model catalog like together ai’s qwen 3.5 9b offering. It’s 25 cents for 1M tokens vs the ridiculous $5/1M tokens for haiku.

zozbot234 · 2026-06-07T23:55:17 1780876517

Not a great example: Qwen 9b is a tiny model that outputs barely coherent text in a casual chat, nowhere near comparable to Haiku. But the broader point stands.

computerex · 2026-06-08T14:51:30 1780930290

I am not sure if you are testing qwen 3.5 9b. I would also verify that you are running it correctly. Qwen 3.5 9b is actually a very capable coding model that can do agentic coding albeit it’s obviously not as good as opus.

You can look up the benchmarks on that model as well. Your experience does not align with mine.

cactusplant7374 · 2026-06-07T15:49:23 1780847363

Are they better? Are they better than GPT5.5?

computerex · 2026-06-07T16:30:52 1780849852

That depends on the use case. For a lot of business use cases they are good enough. They are certainly better than older models like gpt-4o.

computerex · 2026-05-21T06:23:58 1779344638

I feel like that’s already becoming true. I sometimes work on problems/projects where the AI agent is definitely more qualified than me to call the shots.

For example, this library here for deep learning is 100% ai generated and far beyond my technical capabilities.

https://github.com/computerex/dlgo

RealityVoid · 2026-05-21T08:52:07 1779353527

I find AI a great scaffolding for improving understanding and mental models. BUT! It's all in how you use it.

sigmoid10 · 2026-05-21T10:01:47 1779357707

The real question is: Do you need to understand it fully for it to improve your life?

For example, if you're in fundamental science (or generally a fan of reductionism), it for sure would be nice to understand the universe instead of just having access to an AI that can comprehend it. But to the majority of the population it only matters that someone (or something) understands it enough to make it useful to others.

RealityVoid · 2026-05-21T19:56:49 1779393409

Understanding everything fully is futile. But there are many many many things that by understanding you improve your life. So, I feel the question is... not useful, I would say. Yes, you need to search for things that if you knew them you would improve your life. No, you can not know them all beforehand. Yes, there are such things. There always are.

sigmoid10 · 2026-05-23T17:01:57 1779555717

They only improve your life if you actually work on something that you yourself are trying to improve. Most people are fine with the status quo, so if something like LLMs can take over the understanding of complex tasks, they won't even notice, except for the fact that more of these tasks will get done.

RealityVoid · 2026-05-26T22:04:42 1779833082

There are clearly things to understand more than just the immediate stuff you do for work. I think most people are thirsty for understanding, it's just... many times it's in other domains than you expect.

Schlagbohrer · 2026-05-21T12:49:15 1779367755

Reminds me of a Carl Sagan quote, that our society is built on science and technology yet few understand it.

felixg3 · 2026-05-21T17:06:18 1779383178

LLMs are a mirror of the user‘s input capabilities, like every other computer programme.

computerex · 2026-05-19T20:34:16 1779222856

I have never had good experience with any Google models in coding. Particularly for coding hard stuff, there is a night and day difference between Opus/Gemini in my experience.

computerex · 2026-05-12T19:13:44 1778613224

That’s a western perspective because we are spoiled and have no thought for sustainability.

Please take a look at poor countries of the world like Pakistan. They have a repair culture. They have vehicles from the 80’s out on the road doing daily driving work instead of being used as vintage show pieces. It’s a poor country, this is a necessity. But nevertheless seeing the repair culture there in contrast to the disposable culture in the western world makes me pause.

xtracto · 2026-05-12T19:23:05 1778613785

This... I wonder why isn't there a market in Tijuana, Juarez and other border towns for fixing broken electronics and similar appliances.

Here in Mexico there are plenty of "unofficial" laptops/mobile (Apple, Windows, Androids) repair shops that even receive your device by DHL/UPS, fix it and return it. Because the labor costs are low enough to make it worth. The only downside is that most of the spare parts are imported from the US.

carlosjobim · 2026-05-12T19:23:21 1778613801

In Western countries, the time of skilled repairmen is better spent repairing things which are much more important and expensive than consumer goods.

And a consumer usually has a much higher return from working in his specialized field to earn money and buy a new product, than spending time with difficult repairs of a broken product.

JumpCrisscross · 2026-05-12T19:25:47 1778613947

Yeah, this is entirely a function of labor costs. If you want your stuff repaired, ship it to a low-labor cost economy or hire someone to whom it’s worth the time.

carlosjobim · 2026-05-12T21:06:23 1778619983

To add to that; labour should be expensive. And lower repairability of consumer goods is a side effect that is worth dealing with for that benefit.

hx8 · 2026-05-12T20:41:18 1778618478

Just to take one step further, labor costs are largely a function of local real estate costs.

JumpCrisscross · 2026-05-12T21:36:16 1778621776

> labor costs are largely a function of local real estate costs

Difficult to determine causality in that system. All we can say is places with expensive labour tend to have expensive real estate. (The confounding variable, I imagine, is immigration.)

keybored · 2026-05-21T09:26:37 1779355597

The practice of planned obsolescence means that there is more to it than just that.

computerex · 2026-05-12T19:07:08 1778612828

The site crashes on my 2020 iPhone SE.

layer8 · 2026-05-12T19:25:55 1778613955

The product is for Android users.

computerex · 2026-05-13T19:34:39 1778700879

That's fine, wish the site worked so I didn't have to be told that.

computerex · 2026-05-11T18:17:14 1778523434

Okay, now I will be supporting Azure products and will try to bring them into my workplace over AWS/Google Cloud.

orochimaaru · 2026-05-11T18:27:13 1778524033

Why? Microsoft probably just hasn’t prioritized nimbus participation over their other construction work. They probably haven’t yet constructed the correct subsidiary structure or key sharing agreements that allow them to participate either.

Sooner or later they’ll participate. And then you would have moved your workload for no reason.

pnemonic · 2026-05-11T18:27:27 1778524047

I wouldn't be so sure. The departure of these guys only opens new room for less 'pro-ethics' corpos to replace them.

danudey · 2026-05-11T18:44:57 1778525097

The reason cited for this whole fiasco is that some of the Ministry of Defense's genocide work could be performed by servers in the EU, which could expose Microsoft to legal or regulatory issues.

It's not that Microsoft was against this, it's that Microsoft was against themselves getting in trouble for this with the EU.

j_maffe · 2026-05-11T21:08:44 1778533724

Well they did put in their contracts with the Israeli government that their services can't be used for mass surveilance which makes them slightly less evil than Google/Amazon.

computerex · 2026-04-27T18:58:07 1777316287

There is no way to make that cost model profitable consistently. If 1 prompt can mean 100's/1000's of requests over hours, and you only pay for that 1 premium prompt, that can never be profitable.

ThunderSizzle · 2026-04-28T08:21:58 1777364518

They can engineer the harness to limit the amount it does. When pressing enter, it's be nice to have a "budget" per prompt, much like the model multiplier. When the harness used up the budget, it cleans up and cuts off the work.

But that would entail actual work and effort...and care for user's time and money.

computerex · 2026-04-22T01:31:28 1776821488

Model selection for day to day tasks based on vibes is not very scientific. Micromanaging the model doesn't seem like a great idea when doing real professional work with professional goals/deadlines/pressures.

selcuka · 2026-04-22T01:58:26 1776823106

> Micromanaging the model doesn't seem like a great idea when doing real professional work with professional goals/deadlines/pressures.

Remember that it's not only the cost per token, but also speed. Some tasks are done faster with simpler/less-thinking models, so it might actually make sense to micromanage the model when you have deadlines.

computerex · 2026-04-22T03:28:12 1776828492

If you're using the models to generate 99%-100% of the code, then it doesn't make sense to plug yourself into the loop as a bottleneck.

timr · 2026-04-22T01:39:11 1776821951

It’s deeply ironic that the folks who want to outsource as much thought to the model as possible are saying that my stance - use your brain to decide the right tool for the job - is tantamount to “vibes”.

computerex · 2026-04-22T03:25:36 1776828336

You are being deeply reductive and that's against the spirit of hacker news. The issue is that models are difficult to objectively benchmark. The benchmarks don't always align with real world performance. It's not easy and clear cut to determine which model will work best in a given situation. It boils down to loose experiences/anecdotes. Do you have an objective criteria for model selection that you have tested to be effective with reproducible tests?

computerex · 2026-04-22T01:08:25 1776820105

What about the R&D costs of blowing up vehicle after vehicle?

jdross · 2026-04-22T02:00:56 1776823256

They have over 300 falcon 9 launches in a row now, just in case you’re not caught up on the latest

metabagel · 2026-04-22T02:16:05 1776824165

C'mon, you know they're talking about Starship.

inemesitaffia · 2026-04-22T02:40:22 1776825622

It's less than the yearly cost of ground stations (just under 1 million/year per installation)

5 million over 5 years capex+opex. Mostly opex

It's also a troll post