This specify-encode-fulfill loop/method is effective to make agents create bug-free code.
In my version of this workflow I do specify myself, then let the LLM do the rest.
This way 1.) I'm 100% sure the understanding/spec is good 2.) It's translated into an executable format so the implementation can be verified 3.) The implementation has maximum code coverage tests which steers the AI to produce code which follows standards, fits into the existing codebase, and it's very easy to refactor.
So far, this is the one and only advantage of using LLMs in my SWE practice. They glue together (human written) specs with code, with confidence, in no time.
I'm not an ML expert, but regarding code _quality_ I see no progress at all in the last couple of years. LLMs still write code by using probabilistic calculations vs. applying rigorous thinking and logic.
This is only good while no one has to look under the hood. When trying to understand and fix code written by LLMs you'll realize what a mess they produce. It's a codebase without any systematic thinking inside. Everything is ad-hoc, wired together to pass the tests, and to conform to some templates. No deliberate practice, no intelligence at all in the code.
This can't be a long term strategy for an entire industry.
We're going to need some mid-level representation of what software is trying to do. Formal specs? UML? Semi-formal specs in natural language? Design rules?
People hate updating such representations, but AIs don't have that problem.
We _know_ LLMs can't be _that_ good as they are promoted.
I've spent the last 6 months creating a production grade app from scratch with Claude where I wrote no single line of code. I've reviewed code and it was looking good, almost completely following my templates, workflows, skills.
Now I've started to make minor manual updates and I'm horrified. Claude has no idea why there were those templates and instructions in place. It followed them blindly without grasping their spirit. The end result is like a very junior dev copy-pasting answers from Stack Overflow into the codebase. No consistency, chaotic application of different conventions, duplicated code, ghost code (does nothing), and perhaps more as I'm digging in.
The pros: The code works, all tests pass (43% code / 57% tests, 1:1.3 ratio), the UI looks good with visible glitches
The cons: I'll have to rewrite most of the code on the long run, make it fit, easy to maintain.
The verdict: I wouldn't started this project alone. Claude get me through to v0.1.0 / MVP where I've focused solely on the product: technologies, architecture, functionality, and usability. Now it's easier to refactor all for v0.2.0 manually without Claude.
So this might be our gut feeling: we know it's something good, but not as good as the stakeholders might promote. We know it helps in some ways but it's a nightmare in other ways.
We are not anti-AI but rather pragmatic: Not that AI enthusiasts we are expected to be.
> No consistency, chaotic application of different conventions, duplicated code, ghost code (does nothing), and perhaps more as I'm digging in.
I didn’t understand this part. You said you reviewed the code and it was looking good, so how did the cruft creep in?
Were you reviewing every diff, or taking an occasional sample?
Reviewing is a very different mindset than writing it yourself. You don't have all the context you would have built up had you done it, and it's much much more difficult to think through all cases. So I'm thinking: The individual changes all looked good in isolation, and they started borderline rubber-stamping the changes without stepping back to think about the larger context.
Looking at the individual changes in isolation, it's harder to see it doesn't match other conventions, duplicates code, removes or disables paths without cleaning up, etc. I'll bet there's also some crazy spaghetti code in there, from helping a co-worker clean up their Ai-generated code that they didn't understand.
In my version of this workflow I do specify myself, then let the LLM do the rest.
This way 1.) I'm 100% sure the understanding/spec is good 2.) It's translated into an executable format so the implementation can be verified 3.) The implementation has maximum code coverage tests which steers the AI to produce code which follows standards, fits into the existing codebase, and it's very easy to refactor.
So far, this is the one and only advantage of using LLMs in my SWE practice. They glue together (human written) specs with code, with confidence, in no time.
reply