Hacker Newsnew | past | comments | ask | show | jobs | submit | thevinter's commentslogin

Every time a new image gen comes out I keep saying that it won't get better just to be surprised again and again. Some of the examples are incredible (and incredibly scary. I feel like this is truly the point where understanding if something is AI becomes impossible)


So do you think there will be a better image model in a year?


I'll bite: no I don't think so. If the examples are not cherry-picked and by "image model" we mean just the ability to generate pictures, this looks like parity with human excellence, there isn't much space for further improvement. The images don't just look real, they look tasteful- the model is not just generating a credible image, it's generating one that shows the talent of a good photographer/ designer/ artist.


I'm honestly unsure what could be improved at this point.

Consistency? So it fails less often?

Based on the released images, (especially the one "screenshot" of the Mac desktop) I feel like the best images from this model are so visually flawless that the only way to tell they're fake is by reasoning about the content of the image itself (ex. "Apple never made a red iPhone 15, so this image is probably fake" or "Costco prices never end in .96 so this image is probably fake")


There is definitely room for improvement: https://gist.github.com/simonw/88eecc65698a725d8a9c1c918478a...

Especially when it comes to detailed outputs or non-standard prompts.

I do believe it will get even better - not sure it will happen within a year but I wouldn't be incredibly surprised if it did.


Yep. “Where’s Waldo” has been a classic challenge for generative models for a while because it requires understanding the entire concept (there’s only one Waldo), while also holding up to scrutiny when you examine any individual, ordinary figure.

I experimented with the concept of procedural generation of Waldo-style scavenger images with Flux models with rather disappointing results. (unsurprisingly).


That's a good example, actually.

If you asked me what I expected, since this one has "thinking", it'd be that it would've thought to do something like generate the image without Waldo first, then insert Waldo somewhere into that image as an "edit"


I wonder if at this point you could just ask the agent to iteratively refine the image in smaller portions.


I'm been impressed when testing this model today, but it still can't consistently adhere to the following prompt: make me an image of a pizza split into 10 equal slices with space in between the them, to help teach fractions to a child.

It doesn't reliably give you 10 slices, even if you ask it to number them. None of the frontier models seem to be able to get this right


Cost? Speed?


> I'm honestly unsure what could be improved at this point.

That's because you're focusing a little bit too much on visual fidelity. It's still relatively trivial to create a moderately complex prompt and have it fail miserably.

Even SOTA models only scored a 12 out of 15 on my benchmarks, and that was without me deliberately trying to "flex" to break the model.

Here's one I just came up with:

  A Mercator projection of earth where the land/oceans are inverted. (aka land = ocean, and oceans = land)


Good point.

So I guess while "realism" (or believability) is really good now, prompt adherence has much room for improvement.

(though put it another way, realism has always been "solved" if the model gets to output whatever it wants as long as it looks realistic, though now it looks less like a malfunction and more like an inattentive human mistake or oversight, so even when it gets it wrong it's hard to tell it's wrong without knowing what the prompt was)


> it's hard to tell it's wrong without knowing what the prompt was.

Yeah this is actually a huge point of frustration on reddit where lots of people post their "impressive generative images" but fail to disclose the prompts so the audience is only able to evaluate realism/fidelity and not how faithfully the model actually followed the prompt.


It's a very simplistic and radical point of view that doesn't take into account the reality of the world we live in. It also doesn't take into account the intricacies of foreign politics and seems to assume that the gulf states are the only bad actors here. Finally "gulf states" is a catch-all so big that it's borderline funny. (What did Bahrain do?)


Thanks.

They make their money from oil, which we all buy.


You know being against slavery used to be considered radical, right?

If it’s so simplistic how about you explain why it’s so important for Saudi Arabia to perpetuate a genocide in order to acquire gold.

And how about you explain how it’s okay to be economically linked to that type of behavior by proxy. I’m assuming you have some kind of expertise in this subject matter as opposed to just vomiting up whatever the neoliberal talking head “experts” tell you to believe.


And no-one is preventing you from caring about those things. I build UIs with Claude a lot and I still spend a lot of the time thinking about the user experience and working with Claude to make an app as intuitive and easy to use as possible.


I do similar, but I dislike writing CSS because it's practically impossible to keep up with the standards. And because I dislike writing CSS I don't feel like writing HTML that much either.

Web Components were a bit too slow to take off so the mental model of JSX has stuck with me, even if the ecosystem with hooks and various approaches towards reactive state are in many ways inferior to a problem Smalltalk already solved back in the day.


Probably because the person wasn't interested in planning their vacation and wanted just to enjoy the end result?

Let's not assume different people find the same parts of the process enjoyable.


I lived for months with a 4GB roaming plan. Given, I was not using it at home since I had a wifi connection, but I rarely came close to using all my data unless I was watching YT videos when traveling or something.

I share your sentiment and I agree we should be more mindful of people with metered/slow connections, but the last statement feels blown out of proportion.


I used to be able to get away with this by downloading music, podcasts and maps at home.

During the iOS 26 upgrade cycle, iOS deleted all my third-party map apps and then expired the locally downloaded apple maps. My phone also somehow lost my downloaded podcasts + music a few times, but, unlike losing three offline map applications, that didn't strand me in the middle of the woods with no cell coverage and no maps.

I agree that 4GB (or even 1GB) goes very far with a working phone OS though.


I arrived in a small airport at midnight. Served only by Uber. Since I use Lyft elsewhere, my phone had deleted the Uber app. It took 15 minutes to download that: crappy Wifi and some kind of 5G dead zone. Sometimes you really need to download the app.


I had a 200MB data plan until ~ 2018.

I had data turned off most of the time. At home and in the office I had WiFi. Loaded the map before I left home.

Most other places I was too busy doing whatever I was doing to use a phone. Since upgrading, I guess I can look products up in stores now. That's about it.


If you're highly tech literate, you can get by with 4GB or even 3GB.

What you cannot do, contrary to what someone posted in this thread, is get by on 2G. So an ounce of prevention is worth a pound of cure in this case.


Not using it at home likely discounts a lot of personal consumption. If you can get your fill at nights, less need to access the internet during the day.


I've had a 1GB/mo $5/mo plan from good2go for the last 2 years. I've never gone over it. But that's because I go from wifi to wifi all the time and I'm very careful when I'm on cell. That definitely doesn't work for most people!


Yes but isn't it a bit weird to be implying your customers are dogs?


Our customers are morons for using our products and dogs are personable but pretty stupid so yea, makes a lot of sense.


Idk some people love dogs a lot. Maybe more than people!


The average person generally seems less than neutral to see me.

Many people aren’t just openly hostile, they make a point to immediately let you know they aren’t here to help, they’re here to make everyone’s life less pleasant.

With people, there are many scenarios where if you’re out of line, disagree, that’s it. You’re done. They’ll never ever consider you worth any reasonable sort of treatment.

Dogs, by comparison, are angels.


Metaphor confuses, literally.


No. The idea is until it receives the chef’s kiss, it’s dog food.


I think in the analogy that we're the dogs.


Cursor came out 3 years ago. "Agentic" refactors have been a thing for 1.5 years. Vibecoding as a term has been created 1 year ago.

There are multiple companies that deploy to production daily. What are we even talking about?


Right but this agentic stuff was supposed to be the wave where we would finally actually see increased output, so we should probably be seeing it soon if it's real. Like, my dev team should definitely have the actual code they keep talking about their agents making, ready for me to put into production. As should my vendors. Any day now.


What is this nonsense?

You said that none of this was in production and then when people pointed out that it was obviously in production, you shifted the goal post to some other measure that you just imagined in your head.


Well, if it's in production, it's not at my company, any of my vendors, or for that matter any of the software I use in my private life; the pace of all of that is exactly what it was 2 years ago. When it shows up I'll form an opinion.


Let me amend that: one of my vendors has a new diffusion-based noise-reduction plugin that's pretty good though the resource usage is still too high. I imagine that will come down as they improve it. And that's pretty cool. But it didn't come out any faster it's just that it uses diffusion in the plugin itself. But docker was a much bigger impact on the software we use at work than AI has been so far.

I was even trying to come up with a list of software I use in my personal life to see if any of that has started coming out faster, and I came up with:

KDE

Supercollider

Puredata

Mixxx

Renoise

CUDA and ROCM

none of which have had any kind of release acceleration that I know of (though obviously the hardware to use the last two has gotten mind-blowingly expensive, alas). I use maybe three apps on my phone and they aren't updating any more frequently than they used to.

I get that for whatever reason this bugs people, but I'm in a very tech job and have a very tech personal life (just not webdev in either case) and literally have not seen anything I deal with change other than needing to learn to scroll past the AI summary at the top of search results.


What do you expect that it’s gonna announce itself in a modal dialogue when you run the software?

This isn’t like AI image generation where you’re going to convince yourself that you can tell the difference based on how you think it looks. Do you really think no one in the production chain of any of the software that you use picked up copilot in the last two years?

What signal are you hoping to receive that this is happening?


Well like I said in the sibling post to this one I'd expect really any of the software vendors in my professional or personal life to release either more rapidly or with a wider array of features than they were a few years ago, and that hasn't been my experience, at all.


The coding was never the slow part.


I'm certainly sympathetic to that argument, but if you scroll way back this thread started with the question of whether or not AI is transformative, and if it is neither faster nor better that would suggest "no".


Pi was probably the best ad for Claude Code I ever saw.

After my max sub expired I decided to try Kimi on a more open harness, and it ended up being one of the worst (and eye opening experiences) I had with the agentic world so far.

It was completely alienating and so much 'not for me', that afterwards I went back and immediately renewed my claude sub.

https://www.thevinter.com/blog/bad-vibes-from-pi


> I would say that the project actively expects you to be downloading them to fill any missing gaps you might have.

Where did you get this perspective from?

> I thought pi and its tools were supposed to be minimal and extensible. So why is a subagent extension bundling six agents I never asked for that I can’t disable or remove?

Why do you think a random subagents extension is under the same philosophy as pi?

Your blog post says little about pi proper, it's essentially concerned with issues you had with the ecosystem of extensions, often made by random people who either do or do not get the philosophy? Why would that be up to pi to enforce?


Sharing extensions is very much the philosophy. Using them however is less so.

Pi ships with docs that include extensions and the agent looks there for inspiration if you ask it to build a custom extension.

Looking at what others publish is useful!


> if I start the agent in ./folder then anything outside of ./folder should be off limits unless I explicitly allow it, and the same goes for bash where everything not on an allowlist should be blocked by default.

Here's the problem with Claude Code: it acts like it's got security, but it's the equivalent of a "do not walk on grass" sign. There's no technical restrictions at play, and the agent can (maliciously or accidentally) bypass the "restrictions".

That's why Pi doesn't have restrictions by default. The logic is: no matter what agent you are using, you should be using it in a real sandbox (container, VM, whatever).


But the agent has to interact with the world; fetch docs, push code, fetch comments, etc. You can't sandbox everything. So you push that configuration to your sandbox, which is a worse UX that the harness just asking you at the right time what you'd like to do.


I too would like to know what a good UX looks like here but I have doubts that the permission prompts of Claude are the way to go right now.

Within days people become used to just hitting accept and allowlisting pretty much everything. The agents write length scripts into shell scripts or test runners that themselves can be destructive but they immediately allowlisted.


Well, you are imagining a worse UX, but it doesn't have to be. Pi doesn't include a sandboxing story at all (Claude provides an advisory but not mandatory one), but the sandbox doesn't have to be a simple static list of allowed domains/files. It's totally valid to make the "push code" tool in the sandbox send a trigger to code running outside of the sandbox, which then surfaces an interactive prompt to you as a user. That would give you the interactivity you want and be secure against accidentally or deliberately bypassing the sandbox.


So you have to set up that integration instead of letting the agent do it. I suppose the sandbox is more configurable, but do you need that? I thought the draw of pi was that you didn't do all that and let it fly, wheeee!

edit: You're not making it sound easy at all. I don't have to build anything with the other agents.


Certainly not. Pi is "minimalist", so the draw is that it's "easy" to set it up yourself. You can not do that and run it in yolo mode, and you can do that with Claude Code too. Heck you can even use this hypothetical real-sandbox-with-interactive-prompts with Claude Code instead, once you build it.

Back to my original point: Claude Code gives you a false feeling of security, Pi gives you the accurate feeling of not having security.


I had a very similar experience. I have different preferences, but ultimately, my takeaway was that if I want to follow my own version of their philosophy, I should just create my own thing.

In the meantime, the codex/cc defaults are better for me.


Paraphrasing The Dude, that’s like, just your opinion, man.


> As it turns out, the opinions in question are that bash should be enabled by default with no restrictions, that the agent should have access to every file on your machine from the start, and that npm is the only package manager worth supporting.

Yep. This is why I've been going "Hell, no!" and will probably keep doing so.


Technically you're not allowed to use Claude subscription account with Pi (according to Anthropic's policy). So yeah, Pi is the best anti-ad against Anthropic.


hypegrift


Are you intentionally keeping the benchmarks private?


Yes.

I am trying to think what's the best way to give most information about how the AI models fail, without revealing information that can help them overfit on those specific tests.

I am planning to add some extra LLM calls, to summarize the failure reason, without revealing the test.


We're building an app that automatically generates machine/human readable JSON by parsing semantic HTML tags and then by using a reverse proxy we serve those instead of HTML to agents


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: