NHacker Next
login
▲Pitfalls of premature closure with LLM assisted codingshayon.dev
71 points by shayonj 3 days ago | 38 comments
Loading comments...
stillpointlab 4 hours ago [-]
My experience is that AIs amplify what you put in them.

If you put in lazy problem definitions, provide the bare minimum context and review the code cursorily then the output is equally lackluster.

However, if you spend a good amount of time describing the problem, carefully construct a context that includes examples, documentation and relevant files and then review the code with care - you can get some very good code out of them. As I've used them more and more I've noticed that the LLM responds in a thankful way when I provide good context.

> Always ask for alternatives

> Trust but verify

I treat the AI as I would a promising junior engineer. And this article is right, you don't have to accept the first solution from either a junior engineer or the AI. I constantly question the AIs decisions, even when I think they are right. I just checked AI studio and the last message I sent to Gemini was "what is the reasoning behind using metatdata in this case instead of a pricing column?" - the context being a db design discussion where the LLM suggested using an existing JSONB metadata column rather than using a new column. Sometimes I already agree with the approach, I just want to see the AI give an explanation.

And on the trust front, often I let the AI coding agent write the code how it wants to write it rather than force it to write it exactly like I would, just like I would with a junior engineer. Sometimes it gets it right and I learn something. Sometimes it gets it wrong and I have to correct it. I would estimate, 1 out of 10 changes has an obvious error/problem that I have to intervene.

I think of it this way: I control the input and I verify the output.

0x457 2 hours ago [-]
> If you put in lazy problem definitions, provide the bare minimum context and review the code cursorily then the output is equally lackluster.

I thought so too, but sometimes I had better results with one sentence prompt (+README.md) where it delivered the exact thing I wanted. I also had a very detailed prompt with multiple subtasks, all were very detailed +README.md +AGENTS.md and results were very poor.

stillpointlab 2 hours ago [-]
This is true in my experience but it doesn't go against my larger point. Choosing the "goldilocks" context is a bit of an art, not too big not too small. It reminds of a famous witty quote [1]: “I apologize for such a long letter - I didn't have time to write a short one.”

If you send too much info at once it does seem to confuse the agent, just like if you ask it to do too much all at once. That is yet another property it shares with a junior engineer. It is easy to overwhelm a new contributor to a project with too much information, especially if it isn't strictly relevant.

daveguy 2 hours ago [-]
Also, regardless of the prompt they only get ~80% accuracy on coding benchmarks. So even with the absolute perfect prompt incantation, you can expect it to fail 1 out of 5 times.
deadbabe 2 hours ago [-]
The point of LLMs is to not spend a lot of effort. Zero-shot prompts is the ideal we have to work toward. There comes a point where you have to do so much work just to get a good output, that LLMs cease to be more productive than just writing something out yourself.

If it cannot give you a good output with very little prompting, it’s a sign your problem probably isn’t something well known and it probably needs a human touch.

bluefirebrand 31 minutes ago [-]
> There comes a point where you have to do so much work just to get a good output, that LLMs cease to be more productive than just writing something out yourself.

I think this gets to the core of the problem with LLM workflows and why there are so many disagreements about effectiveness

Maybe I overestimate my skills or underestimate how long things would take, but I am constantly feeling like when I try to use AI it takes more time, not less

My suspicion is that if you could create a second version of me, give one copy of them an LLM and one copy solves the problem normally, this would be the case

But many people love these tools and feel more productive, so what gives? The problem it is impossible to really measure because we don't have convenient parallel universe clones to test against. It's all just vibes and made up numbers

stillpointlab 2 hours ago [-]
I could not disagree more. I don't think you are wrong, I just choose a different approach.

There seem to be multiple approaches to working with LLMs. My own personal experience has been that carefully explaining my request, providing specific and highly relevant context (and avoiding irrelevant and distracting context) has lead to significant productivity on my side. That is, it may take me 15 minutes to prepare a really good prompt but the output can save me hours of work. Conversely, if I fire of a bunch of low-effort prompts I get poor results and I end up spending a lot of time in back-and-forth with the LLM and a lot of time fixing up its output.

deadbabe 2 hours ago [-]
If I think carefully about a problem for 15 minutes I generally can also save hours of work.
jbellis 6 hours ago [-]
I think this is correct, and I also think it holds for reviewing human-authored code: it's hard to do the job well without first having your own idea in your head of what the correct solution looks like [even if that idea is itself flawed].
danielbln 10 hours ago [-]
I put the examples he gave into Claude 4(Sonnet) purely asking to eval the code, it pointed out every single issue about the code snippets (N+1 Query, race condition, memory leak). The article doesn;t mention which model was used, or how exactly it was used, or in which environment/IDE it was used.

The rest of the advice in there is sound, but without more specifics I don't know how actionable the section "The spectrum of AI-appropriate tasks" really is.

metalrain 6 hours ago [-]
It's not about "model quality". Most models can improve code their output when asked, but problem is the lack of introspection by the user.

Basically same problem as copy paste coding, but LLM can (sometimes) know your exact variable names, types so it's easier to forget that you need to understand and check the code.

shayonj 9 hours ago [-]
My experience hasn't changed between models, given the core issue mentioned in the article. Primarily I have used Gemini and Claude 3.x and 4. Some GPT 4.1 here and there.

All via Cursor, some internal tools and Tines Workbench

soulofmischief 5 hours ago [-]
My experience changes just throughout the day on the same model, it seems pretty clear that during peak hours (lately most of the daytime) Anthropic is degrading their models in order to meet demand. Claude becomes a confident idiot and the difference is quite noticeable.
drewnick 3 hours ago [-]
I too have noticed variability and it's impossible to know for sure but late one Friday or Saturday night (PST) it seemed to be brilliant, several iterations in a row. Some of my best output has been in very short windows.
genewitch 5 hours ago [-]
this is on paid plans?
soulofmischief 3 hours ago [-]
This is through providers such as Cursor, but the consistency of this experience has put me off from directly subscribing to Anthropic since I'm already subscribed up to my eyeballs in various AI services.

Last I'd checked, Anthropic would not admit that they were degrading models for obvious scummy business reasons, but they are probably quantizing them, reducing beam search, lowering precision/sampling), etc. because the model goes from being superpowered to completely unusable, constantly dropping code and mangling files, getting caught in loops, doing the weirdest detours, and sometimes completely ignoring my instructions from just one message prior.

t first I wondered if Cursor was mishandling the context, and while they indeed aren't doing the best with context stuffing, the rest of the issues are not context-related.

lazy_moderator1 5 hours ago [-]
did it detect n+1 in the first one, race condition in the second one and memory leak in the third one?
danielbln 4 hours ago [-]
It did, yeah.
lubujackson 5 hours ago [-]
My experience with LLMs currently is that they can handle any level of abstraction and focus, but you have discern the "layer" to isolate and resolve.

The next improvement may be something like "abstraction isolation" but for now I can vibe code a new feature which will produce something mediocre. Then I ask "is that the cleanest approach?" and it will improve it.

Then I might ask "is this performant?" Or "does this follow the structure used elsewhere?" Or "does this use existing data structures appropriately?" Etc.

Much like the blind men describing an elephant they all might be right, but collectively can still be wrong. Newer, slower models are definitely better at this, but I think rather than throwing infinite context at problems if they were designed with a more top down architectural view and a checklist of competing concerns we might get a lot further in less time.

This seems to be how a lot of people are using them effectively right now - create an architecturr, implement piecemeal.

vjerancrnjak 5 hours ago [-]
I’ve found it to never be able to produce the cleanest approach. I can spend 5 hours and get something very simple and I can even give it that as an example and it cannot extrapolate.

It can’t even write algorithms that rely on the fact that something is sorted. It needs intermediate glue that is not necessary, etc. massive noise.

Tried single allocation algorithms, can’t do that. Tried guiding to exploit invariants, can’t find single pass workflows.

The public training data is just bad and it can’t really understand what actually good is.

bongodongobob 4 hours ago [-]
Define clean, what does that even mean? If you tell it to write efficient code for example, how would it know whether you're taking about caching, RAM, Big O, I/O etc unless you tell it?
nonethewiser 5 hours ago [-]
This has been my experience too. It's largely about breaking things down into smaller problems. LLMs just stop being effective when the scope gets too large.

Architecture documentation are helpful too, as you mentioned. They are basically a set of rules and intentions. It's kind of a compressed version of your codebase.

Of course, this means the programmer still has to do all the real work.

cryptonym 5 hours ago [-]
Sounds like coding with extra steps.
nonethewiser 4 hours ago [-]
What is the extra step? You have to do the upfront legwork either way.
bluefirebrand 3 hours ago [-]
In my experience when you have a problem that is small enough for an LLM to solve, you could just write the code directly. You don't have to produce a detailed spec first

If the LLM needs a detailed spec to solve the same problem then you're doing unnecessary work to produce the spec for the LLM first

th0ma5 3 hours ago [-]
This has been the problem with higher level natural language programming for years. I really wonder what people are doing if they don't see this core issue that precludes their use.
bluefirebrand 2 hours ago [-]
It makes me wonder if some people writing code just cannot think in terms of code?

I imagine it is very slow if you always have to think in a human language and then translate each step into programming language

When people describe being in flow state, I think what is happening is they are more or less thinking directly in the programming language they are writing. No translation step, just writing code

LLM workflows completely remove the ability to achieve that imo

mrklol 5 hours ago [-]
"But what if the real issue is an N+1 query pattern that the index would merely mask? What if the performance problem stems from inefficient data modeling that a different approach might solve more elegantly?"

In the best case you would have to feed every important information into the context. These are my indexes, this is my function, these are my models. After that the model can find problematic code. So the main problem is go give your model all of the important information, that has to be fixed if the user isn’t doing that part (Obviously that doesn’t mean that the LLM will find a correct problem, but it can improve the results).

5 hours ago [-]
nonethewiser 5 hours ago [-]
This largely seems like an alternative way of saying "you have to validate the results of an LLM." Is there any "premature closure" risk if you simply validate the results?

Premature closure is definitely a risk with LLMs but I think code is much less at risk because you can and SHOULD test it. But its a bigger problem for things you cant validate.

I might starting calling this "the original sin" with LLMs... not validating the output. There are many problems people have identified with using LLMs and perhaps all of them come back to not validating.

bwfan123 5 hours ago [-]
> I might starting calling this "the original sin" with LLMs... not validating the output.

I would rephrase it - the original sin of llms is not "understanding" what they output. By "understanding" it is meant the "the why of the output" - starting from the original problem and reasoning through to the output solution - ie, the causation process behind the output. What llms do is to pattern match a most "plausible output" to the input. But the output is not born out of a process of causation - it is born out of pattern matching.

Humans can find meaning in the output of LLMs, but machines choke at it - which is why, LLM code looks fine at a first glance until someone tries to run it. Another way to put it is, LLMs sound persuasive to humans - but at the core are rote students who dont understand what they are saying.

nonethewiser 4 hours ago [-]
I agree that its important to understand they are just predicting what you should expect to see. And that is a more fundamental truth about an LLM itself. But im thinking more in terms of using LLMs. That distinction doesn't really matter if the tests pass.
wizzwizz4 4 hours ago [-]
http://thecodelesscode.com/case/58
MontagFTB 6 hours ago [-]
This isn’t new. Have we not already seen this everywhere already? The example at the top of the article (in a completely different field, no less) just goes to show humans had this particular sin nailed well before AI came along.

Bloated software and unstable code bases abound. This is especially prevalent in legacy code whose maintenance is handed down from one developer to the next, where their understanding of the code base differs from their predecessor’s. Combine that with pressures to ship now vs. getting it right, and you have the perfect recipe for an insipid form of technical debt.

suddenlybananas 10 hours ago [-]
I initially thought that layout of the sections was an odd and terrible poem.
tempodox 6 hours ago [-]
Now that you mention it, me too.
mock-possum 6 hours ago [-]
Oh wow me too - I kinda like it that way.

But if it’s meant to be a table of contents, it really should be styled like a list, rather than a block quote.

shayonj 9 hours ago [-]
haha! I didn't see it that way originally. Shall take it as a compliment and rework that ToC UI a bit :D.
shayonj 6 hours ago [-]
ok ok - i put in a little touch up :D
6 hours ago [-]
llmenth 8 hours ago [-]
[dead]