Hacker Newsnew | past | comments | ask | show | jobs | submit | computerex's commentslogin

The flip side is that benchmarks are gamed even by the top labs. Benchmark performance doesn't necessarily correlate with real world performance.

Again correct but it overstates the issue. I can say labs don’t want this. This happened arguably unintentionally in Metas llama 4 release, it went horribly, heads rolled, and like several billion dollars were paid for new talent and the org that built llama 4 was destroyed.

Evals come from a million places and new evals and robust perturbations of existing evals abound. They test a variety of tasks in a variety of ways. All of them individually are flawed. Taken together the aggregate signal is highly useful as you more or less marginalize over a lot of different things. Not to mention these companies have plenty of proprietary internal measurements, they build benchmarks themselves to probe their models and then also have flywheel traffic and A/B tests.

You are right to call out benchmarks but to dismiss them or not take them seriously is a mistake.


Listen, you can say “but benchmarks, the benchmarks!” all day long, but consumer know when we are being sold a lemon. If it can’t do the most basic of things at least as good as it used to, this is table stakes. Nevermind that if you can’t do the basic stuff, how on earth can you be trusted with more?

And you can say “If it can’t do the most basic of things at least as good as it used to, this is table stakes” all day long while people point you to much better evidence to the contrary too, I’d rather be on the other side of that.

Listen. I don’t care about evidence. I care about my lived experience for the product I paid for. I used the new product. It’s actively terrible. To the point of not being usable. We’re all ancedata, but what is “better evidence to the contrary”? The known and game-able benchmarks that they know they need to win at, so they train it to. It’s all he said, she said, which is the only reason we keep having this conversation.

Yea but it’s not right? You or I or the myriad of other institutions inside and outside of academia can probe these models with an evolving landscape of evaluation sets, even those unavailable to the developers. It’s just ignorance to claim benchmarks are somehow useless or all being gamed. You choose your tools in the way you want, but just don’t call it somehow better than a myriad of more carefully constructed setups and scaled evaluations.

Not true. Togetherai, deepinfra, fireworks AI offer a wide range of models like gpt oss that are very capable and far cheaper than the models from big 3.

I'm referring to Chinese open source models hosted on American clouds vs Chinese clouds. You're talking about an old and non-agentic capable American produced model.

You are actually referring to open weight models, not open source. Gpt-OSS is an example of an open weight model. It’s highly capable in agentic settings, people use it for coding all the time.

My greater point remains. Models like the qwen variants, minimax, k2.5, glm models are available by American providers like AWS at a much cheaper price than api offerings from the big three LLM providers.

Your point about Chinese models being cheap only on Chinese hardware makes absolutely zero sense. You can check out the model catalog like together ai’s qwen 3.5 9b offering. It’s 25 cents for 1M tokens vs the ridiculous $5/1M tokens for haiku.


Not a great example: Qwen 9b is a tiny model that outputs barely coherent text in a casual chat, nowhere near comparable to Haiku. But the broader point stands.

I am not sure if you are testing qwen 3.5 9b. I would also verify that you are running it correctly. Qwen 3.5 9b is actually a very capable coding model that can do agentic coding albeit it’s obviously not as good as opus.

You can look up the benchmarks on that model as well. Your experience does not align with mine.


Are they better? Are they better than GPT5.5?

That depends on the use case. For a lot of business use cases they are good enough. They are certainly better than older models like gpt-4o.

I feel like that’s already becoming true. I sometimes work on problems/projects where the AI agent is definitely more qualified than me to call the shots.

For example, this library here for deep learning is 100% ai generated and far beyond my technical capabilities.

https://github.com/computerex/dlgo


I find AI a great scaffolding for improving understanding and mental models. BUT! It's all in how you use it.


The real question is: Do you need to understand it fully for it to improve your life?

For example, if you're in fundamental science (or generally a fan of reductionism), it for sure would be nice to understand the universe instead of just having access to an AI that can comprehend it. But to the majority of the population it only matters that someone (or something) understands it enough to make it useful to others.


Understanding everything fully is futile. But there are many many many things that by understanding you improve your life. So, I feel the question is... not useful, I would say. Yes, you need to search for things that if you knew them you would improve your life. No, you can not know them all beforehand. Yes, there are such things. There always are.


They only improve your life if you actually work on something that you yourself are trying to improve. Most people are fine with the status quo, so if something like LLMs can take over the understanding of complex tasks, they won't even notice, except for the fact that more of these tasks will get done.


There are clearly things to understand more than just the immediate stuff you do for work. I think most people are thirsty for understanding, it's just... many times it's in other domains than you expect.

Reminds me of a Carl Sagan quote, that our society is built on science and technology yet few understand it.


LLMs are a mirror of the user‘s input capabilities, like every other computer programme.


I have never had good experience with any Google models in coding. Particularly for coding hard stuff, there is a night and day difference between Opus/Gemini in my experience.


That’s a western perspective because we are spoiled and have no thought for sustainability.

Please take a look at poor countries of the world like Pakistan. They have a repair culture. They have vehicles from the 80’s out on the road doing daily driving work instead of being used as vintage show pieces. It’s a poor country, this is a necessity. But nevertheless seeing the repair culture there in contrast to the disposable culture in the western world makes me pause.


This... I wonder why isn't there a market in Tijuana, Juarez and other border towns for fixing broken electronics and similar appliances.

Here in Mexico there are plenty of "unofficial" laptops/mobile (Apple, Windows, Androids) repair shops that even receive your device by DHL/UPS, fix it and return it. Because the labor costs are low enough to make it worth. The only downside is that most of the spare parts are imported from the US.


In Western countries, the time of skilled repairmen is better spent repairing things which are much more important and expensive than consumer goods.

And a consumer usually has a much higher return from working in his specialized field to earn money and buy a new product, than spending time with difficult repairs of a broken product.


Yeah, this is entirely a function of labor costs. If you want your stuff repaired, ship it to a low-labor cost economy or hire someone to whom it’s worth the time.


To add to that; labour should be expensive. And lower repairability of consumer goods is a side effect that is worth dealing with for that benefit.


Just to take one step further, labor costs are largely a function of local real estate costs.


> labor costs are largely a function of local real estate costs

Difficult to determine causality in that system. All we can say is places with expensive labour tend to have expensive real estate. (The confounding variable, I imagine, is immigration.)


The practice of planned obsolescence means that there is more to it than just that.


The site crashes on my 2020 iPhone SE.


The product is for Android users.


That's fine, wish the site worked so I didn't have to be told that.


Okay, now I will be supporting Azure products and will try to bring them into my workplace over AWS/Google Cloud.


Why? Microsoft probably just hasn’t prioritized nimbus participation over their other construction work. They probably haven’t yet constructed the correct subsidiary structure or key sharing agreements that allow them to participate either.

Sooner or later they’ll participate. And then you would have moved your workload for no reason.


I wouldn't be so sure. The departure of these guys only opens new room for less 'pro-ethics' corpos to replace them.


The reason cited for this whole fiasco is that some of the Ministry of Defense's genocide work could be performed by servers in the EU, which could expose Microsoft to legal or regulatory issues.

It's not that Microsoft was against this, it's that Microsoft was against themselves getting in trouble for this with the EU.


Well they did put in their contracts with the Israeli government that their services can't be used for mass surveilance which makes them slightly less evil than Google/Amazon.


There is no way to make that cost model profitable consistently. If 1 prompt can mean 100's/1000's of requests over hours, and you only pay for that 1 premium prompt, that can never be profitable.


They can engineer the harness to limit the amount it does. When pressing enter, it's be nice to have a "budget" per prompt, much like the model multiplier. When the harness used up the budget, it cleans up and cuts off the work.

But that would entail actual work and effort...and care for user's time and money.


Model selection for day to day tasks based on vibes is not very scientific. Micromanaging the model doesn't seem like a great idea when doing real professional work with professional goals/deadlines/pressures.


> Micromanaging the model doesn't seem like a great idea when doing real professional work with professional goals/deadlines/pressures.

Remember that it's not only the cost per token, but also speed. Some tasks are done faster with simpler/less-thinking models, so it might actually make sense to micromanage the model when you have deadlines.


If you're using the models to generate 99%-100% of the code, then it doesn't make sense to plug yourself into the loop as a bottleneck.


It’s deeply ironic that the folks who want to outsource as much thought to the model as possible are saying that my stance - use your brain to decide the right tool for the job - is tantamount to “vibes”.


You are being deeply reductive and that's against the spirit of hacker news. The issue is that models are difficult to objectively benchmark. The benchmarks don't always align with real world performance. It's not easy and clear cut to determine which model will work best in a given situation. It boils down to loose experiences/anecdotes. Do you have an objective criteria for model selection that you have tested to be effective with reproducible tests?


What about the R&D costs of blowing up vehicle after vehicle?


They have over 300 falcon 9 launches in a row now, just in case you’re not caught up on the latest


C'mon, you know they're talking about Starship.


It's less than the yearly cost of ground stations (just under 1 million/year per installation)

5 million over 5 years capex+opex. Mostly opex

It's also a troll post


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: