HN Reader

The highest quality codebase

641

393

2 months agoby Gricha

Claude is really good at specific analysis, but really terrible at open-ended problems.

"Hey claude, I get this error message: <X>", and it'll often find the root cause quicker than I could.

"Hey claude, anything I could do to improve Y?", and it'll struggle beyond the basics that a linter might suggest.

It suggested enthusiastically a library for <work domain> and it was all "Recommended" about it, but when I pointed out that the library had been considered and rejected because <issue>, it understood and wrote up why that library suffered from that issue and why it was therefore unsuitable.

There's a significant blind-spot in current LLMs related to blue-sky thinking and creative problem solving. It can do structured problems very well, and it can transform unstructured data very well, but it can't deal with unstructured problems very well.

That may well change, so I don't want to embed that thought too deeply into my own priors, because the LLM space seems to evolve rapidly. I wouldn't want to find myself blind to the progress because I write it off from a class of problems.

But right now, the best way to help an LLM is have a deep understanding of the problem domain yourself, and just leverage it to do the grunt-work that you'd find boring.

1 month agoby xnorswap

One of my favorite personal evals for llms is testing its stability as a reviewer.

The basic gist of it is to give the llm some code to review and have it assign a grade multiple times. How much variance is there in the grade?

Then, prompt the same llm to be a "critical" reviewer with the same code multiple times. How much does that average critical grade change?

A low variance of grades across many generations and a low delta between "review this code" and "review this code with a critical eye" is a major positive signal for quality.

I've found that gpt-5.1 produces remarkably stable evaluations whereas Claude is all over the place. Furthermore, Claude will completely [and comically] change the tenor of its evaluation when asked to be critical whereas gpt-5.1 is directionally the same while tightening the screws.

You could also interpret these results to be a proxy for obsequiousness.

Edit: One major part of the eval i left out is "can an llm converge on an 'A'?" Let's say the llm gives the code a 6/10 (or B-). When you implement its suggestions and then provide the improved code in a new context, does the grade go up? Furthermore, can it eventually give itself an A, and consistently?

It's honestly impressive how good, stable, and convergent gpt-5.1 is. Claude is not great. I have yet to test it on Gemini 3.

1 month agoby postalcoder

LLMs have this strong bias towards generating code, because writing code is the default behavior from pre-training.

Removing code, renaming files, condensing, and other edits is mostly a post-training stuff, supervised learning behavior. You have armies of developers across the world making 17 to 35 dollars an hour solving tasks step by step which are then basically used to generate prompt/responses pairs of desired behavior for a lot of common development situations, adding desired output for things like tool calling, which is needed for things like deleting code.

A typical human working on post-training dataset generation task would involve a scenario like: given this Dockerfile for a python application, when we try to run pytest it fails with exception foo not found. The human will notice that package foo is not installed, change the requirements.txt file and write this down, then he will try pip install, and notice that the foo package requires a certain native library to be installed. The final output of this will be a response with the appropriate tool calls in a structured format.

Given that the amount of unsupervised learning is way bigger than the amount spent on fine-tuning for most models, it is not surprise that given any ambiguous situation, the model will default to what it knows best.

More post-training will usually improve this, but the quality of the human generated dataset probably will be the upper bound of the output quality, not to mention the risk of overfitting if the foundation model labs embrace SFT too enthusiastically.

1 month agoby elzbardico

I like to ask LLMs to find problems o improvements in 1-2 files. They are pretty good at finding bugs, but for general code improvements, 50-60% edits are trash. They add completely unnecessary stuff. If you ask them to improve a pretty well-written code, they rarely say it's good enough already.

For example, in a functional-style codebase, they will try to rewrite everything to a class. I have to adjust the prompt to list things that I'm not interested in. And some inexperienced people are trying to write better code by learning from such changes of LLMs...

1 month agoby f311a

Yeah. I noticed Claud suffers when it reaches context overload - its too opinionated, so it shortens its own context with decisions I would not ever make, yet I see it telling itself that the shortcuts are a good idea because the project is complex...then it gets into a loop where it second guesses its own decisions and forgets the context and then continues to spiral uncontrollably into deeper and deeper failures - often missing the obvious glitch and instead looking into imaginary land for answers - constantly diverting the solution from patching to completely rewriting...

I think it suffers from performance anxiety...

----

The only solution I have found is to - rewrite the prompt from scratch, change the context myself, and then clear any "history or memories" and then try again.

I have even gone so far as to open nested folders in separate windows to "lock in" scope better.

As soon as I see the agent say "Wait, that doesnt make sense, let me review the code again" its cooked

2 months agoby kderbyma

The point he’s making - that LLM’s aren’t ready for broadly unsupervised software development - is well made.

It still requires an exhausting amount of thought and energy to make the LLM go in the direction I want, which is to say in a direction which considers the code which is outside the current context window.

I suspect that we will not solve the context window problem for a long time. But we will see a tremendous growth in “on demand tooling” for things which do fit into a context window and for which we can let the AI “do whatever it wants.”

For me, my work product needs to conform to existing design standards and I can’t figure out how to get Claude to not just wire up its own button styles.

But it’s remarkable how—despite all of the nonsense—these tools remain an irreplaceable part of my work life.

1 month agoby iambateman

This is an interesting experiment that we can summarize as "I gave a smart model a bad objective", with the key result at the end

"...oh and the app still works, there's no new features, and just a few new bugs."

Nobody thinks that doing 200 improvement passes on functioning code base is a good idea. The prompt tells the model that it is a principal engineer, then contradicts that role the imperative "We need to improve the quality of this codebase". Determining when code needs to be improved is a responsibility for the principal engineer but the prompt doesn't tell the model that it can decide the code is good enough. I think we would see a different behavior if the prompt was changed to "Inspect the codebase, determine if we can do anything to improve code quality, then immediately implement it." If the model is smart enough, this will increasingly result in passes where the agent decides there is nothing left to do.

In my experience with CC I get great results where I make an open ended question about a large module and instruct it to come back to me with suggestions. Claude generates 5-10 suggestions and ranks them by impact. It's very low-effort from the developer's perspective and it can generate some good ideas.

1 month agoby samuelknight

I've heard a very apt criticism of the current batch of LLMs:

LLMs are incapable of reducing entropy in a code base

I've always had this nagging feeling, but I think this really captures the essence of it succintly.

1 month agoby torginus

Hilarious! Kinda reinforces the idea that LLMs are like junior engineers with infinite energy.

But just telling an AI it's a principal engineer does not make it a principal engineer. Firstly, that is such a broad, vaguely defined term, and secondly, typically that level of engineering involves dealing with organizational and industry issues rather than just technical ones.

And so absent a clear definition, it will settle on the lowest common denominator of code quality, which would be test coverage -- likely because that is the most common topic in its training data -- and extrapolate from that.

The other thing is, of course, the RL'd sycophancy which compels it to do something, anything, to obey the prompt. I wonder what would happen if tweaked the prompt just a little bit to say something like "Use your best judgement and feel free to change nothing."

1 month agoby keeda

While there are justifiable comments here about how LLMs behave, I want to point out something else:

There is no consensus on what constitutes a high quality codebase.

Said differently - even if you asked 200 humans to do this same exercise, you would get 200 different outputs.

1 month agoby mbesto

This is a great example of there being no intelligence under the hood.

1 month agoby m101

Well of course it produced bad results... it was given a bad prompt. Imagine how things would have turned out if you had given the same instructions to a skilled but naive contractor who contractually couldn't say no and couldn't question you. Probably pretty similar.

1 month agoby hazmazlaz

I spent some time last night "over iterating" on a plan to do some refactoring in a large codebase.

I created the original plan with a very specific ask - create an abstraction to remove some tight coupling. Small problem that had a big surface area. The planning/brainstorming was great and I like the plan we came up with.

I then tried to use a prompt like OP's to improve it (as I said, large surface area so I wanted to review it) - "Please review PLAN_DOC.md - is it a comprehensive plan for this project?". I'd run it -> get feedback -> give it back to Claude to improve the plan.

I (naively perhaps) expected this process to converge to a "perfect plan". At this point I think of it more like a probability tree where there's a chance of improving the plan, but a non-zero chance of getting off the rails. And once you go off the rails, you only veer further and further from the truth.

There are certainly problems where "throwing compute" at it and continuing to iterate with an LLM will work great. I would expect those to have firm success criteria. Providing definitions of quality would significantly improve the output here as well (or decrease the probability of going off the rails I suppose). Otherwise Claude will confuse quality like we see here.

Shout out OP for sharing their work and moving us forward.

1 month agoby dcchuck

You know how when someone hears how many engineerings are working on a product, and you think to yourself, "but I could do that with like three people!"? Now you know why they have so many people. Because they did this with their codebase, but with humans.

Or I should say, they kept hiring the humans who needed something to do, and basically did what this AI did.

1 month agoby jedberg

About a year ago I wrote a blog post (HN discussion: https://news.ycombinator.com/item?id=42584400) experimenting if asking Claude to "write code better" repeatedly would indeed cause it to write better code, determined by speed as better code implies more efficient algorithms. I found that it did indeed work (at n=5 iterations), but additionally providing a system prompt also explicitly improved it.

Given with what I've seen from Claude 4.5 Opus, I suspect the following test would be interesting: attempt to have Claude Code + Haiku/Sonnet/Opus implement and benchmark an algorithm with:

- no CLAUDE.md file

- a basic CLAUDE.md file

- an overly nuanced CLAUDE.md file

And then both test the algorithm speed and number of turns it takes to hit that algorithm speed.

1 month agoby minimaxir

lol 5000 tests. Agentic code tools have a significant bias to add versus remove/condense. This leads to a lot of bloat and orphaned code. Definitely something that still needs to be solved for by agentic tools.

1 month agoby maddmann

https://github.com/Gricha/macro-photo/blob/highest-quality/l...

The logger library which Claude created is actually pretty simple, highly approachable code, with utilities for logging the timings of async code and the ability to emit automatic performance warnings.

I have been using LogTape (https://logtape.org) for JavaScript logging, and the inherited, category-focused logging with different sinks has been pretty great.

1 month agoby bikeshaving

With a good programmer, if they do multiple passes of a refactor, each pass makes the code more elegant, and the next pass easier to understand and further improve.

Claude has a bias to add lines of code to a project, rather than make it more concise. Consequently, each refactoring pass becomes more difficult to untangle, and harder to improve.

Ideally, in this experiment, only the first few passes would result in changes - mostly shrinking the project size, and from then on, Claude would change nothing - just a like a very good programmer.

This is the biggest problem with developing with Claude, by far. Anthropic should laser focus on fixing it.

1 month agoby thomassmith65

> We went from around 700 to a whooping 5369 tests

> Tons of tests got added, but some tests that mattered the most (maestro e2e tests that validated the app still works) were forgotten.

I've seen many LLM proponents often cite the number of tests as a positive signal.

This smells, to me, like people who tout lines of code.

When you are counting tests in the thousands I think its a negative signal.

You should be writing property based tests rather than 'assert x=1', 'assert x=2', 'assert x=-1' and on and on.

If LLMs are incapable of acknowledging that then add it to the long list of 'failure modes'.

1 month agoby failuremode

Story of AI:

For instance - it created a hasMinimalEntropy function meant to "detect obviously fake keys with low character variety". I don't know why.

1 month agoby Bombthecat

I think the prompt is a major source of the issue. "We need to improve the quality of this codebase" implicitly indicates that there is something wrong with the codebase. I would be curious to see if it would reach a point of convergence with a prompt that allowed for it. Something like "Improve the quality of this codebase, or tell me that it is already in an optimal state."

1 month agoby bulletsvshumans

I would love to see an experiment done like this with an arena of principal engineer agents. Give each of them a unique personality: this one likes shiny new objects and is willing to deal with early adopter pain, this one is a neckbeard who uses emacs as pid 1 and sends email via usb thumbdrive, and the third is a pragmatic middle of the road person who can help be the glue between them. All decisions need to reach a quorum before continuing. Better yet: each agent is running on a completely different model from a different provider. 3 can be a knob you dial up to 5, 10, etc. Each of these agents can spawn sub-agents, to reach out to professionals like a CSS export, or a DBA.

I think prompt engineering could help here a bit, adding some context on what a quality codebase is, remove everything that is not necessary, consider future maintainability (20->84k lines is a smell). All of these are smells that like a simple supervisor agent could have caught.

1 month agoby whalesalad

you gotta be strategic about it. so for example for tests, tell it to use equivalence testing and to prove it, e.g. create a graph of permutations of arguments and their equivalences from the underlying code, and then use such thing to generate the tests.

telling it to do better without any feedback obviously is going to go nowhere fast.

1 month agoby websiteapi

Have you tried writing into the AGENTS.md something like, "Always be on the lookout for dead code, copy-pasta, and other opportunities to optimize and trim the codebase in a sensible way."

In my experience, adding this kind of instruction to the context window causes SOTA coding models to actually undertake that kind of optimization while development carries on. You can also periodically chuck your entire codebase into Gemini-3 (with its massive context window) and ask it to write a refactoring plan; then, pass that refactoring plan back into your day-to-day coding environment such as Cursor or Codex and get it to take a few turns working away at the plan.

As with human coders, if you let them run wild "improving" things without specifically instructing them to also pay attention to bloat, bloat is precisely what you will get.

1 month agoby ttul

Funniest part:

> ..oh and the app still works, there's no new features, and just a few new bugs.

1 month agoby elzbardico

> I like Rust's result-handling system, I don't think it works very well if you try to bring it to the entire ecosystem that already is standardized on error throwing.

I disagree, it's very useful even in languages that have exception throwing conventions. It's good enough for the return type for Promise.allSettled api.

The problem is when I don't have the result type I end up approximating it anyway through other ways. For a quick project I'd stick with exceptions but depending on my codebase I usually use the Go style ok, err tuple (it's usually clunkier in ts though) or a rust style result type ok err enum.

2 months agoby written-beyond

On the Result<TR, TE> responses... I've seen this a few times. I think it works well in Rust or other languages that don't have the ability to "throw" baked in. However, when you bolt it on to a language that implicitly can throw, you're now doing twice the work as you have to handle the explicit error result and integrated errors.

I worked in a C# codebase with Result responses all over the place, and it just really complicated every use case all around. Combined with Promises (TS) it's worse still.

1 month agoby tracker1

I checked the diffs of the `highest-quality` branch vs `main` and immediately noticed an `as any`

https://github.com/Gricha/macro-photo/compare/main...highest...

Not what I would expect from a prompt like "you're a principal engineer"

1 month agoby culi

One and half years ago, in Japanese Twitter this method gathered a bit of attention. It's called pawahara prompt (パワハラプロンプト, power harassment prompt) because it's like your asshole boss repeatedly saying "can you improve this more?" without any helpful suggestions until the employees breakdown. Many people found it could improve the code base at some point even then, I think now it works much better.

1 month agoby hamasho

Impressive that the app still works! Did not expect that.

1 month agoby Hammershaft

I'm curious if anyone has written a "Principal Engineer" agents.md or CLAUDE.md style file that yields better results than the 'junior dev' results people are seeing here.

I've worked on writing some as a data scientist, and I have gotten the basic claude output to be much better; it makes some saner decisions, it validates and circles back to fix fits, etc.

1 month agoby blobbers

This reflects my experience with human programmers. So many devs are taught to add layers of complexity in pursuit of "best practices". I think the LLM was trained to behave this way.

In my experience, Claude can actually clean up a repo rather nicely if you ask it to (1) shrink source code size (LOC or total bytes), (2) reduce dependencies, and (3) maintain integration tests.

1 month agoby surprisetalk

> ...oh and the app still works, there's no new features, and just a few new bugs.

Many apps out there with developers religiously worshipping high quality and over-engineering over a single app with less than 10 users or if they are lucky enough to get over 1,000 users.

…and all of that and not a single dollar was made. Might as well donated it to Anthropic.

1 month agoby rvz

> I can sort of respect that the dependency list is pretty small, but at the cost of very unmaintainable 20k+ lines of utilities. I guess it really wanted to avoid supply-chain attacks.

> Some of them are really unnecessary and could be replaced with off the shelf solution

Lots of people would regard this as a good thing. Surely the LLM can't guess which kind you are.

1 month agoby barbazoo

> Read and summarize the project

> Implement a fresh project based off of this description

Genuine question, if we were to ask AI to do those two steps to generate a different code base from scratch entirely, does it qualify for a "clean room" design legally speaking?

1 month agoby devy

My current fav improvement strategy is

1) Run multiple code analysis tools over it and have the LLM aggregate it with suggestions

2) ask the LLM to list potential improvements open ended question and pick by hand which I want

And usually repeat the process with a completely different model (ie diff company trained it)

Any more and yeah they end up going in circles

1 month agoby Havoc

It can be difficult to explain to management why in certain scenarios AI can seem to work coding miracles, but this still doesn’t mean it’s going always speed up development 10x especially for an established code base.

Tangible examples like this seem like a useful way to show some of the limitations.

1 month agoby WhitneyLand

> The version "pre improving quality" was already pretty large. We are talking around 20k lines of TS

Even before that rest of it, 10k lines of code for an app with 5 screens is... yeah. Reminds me of "enterprise" Java codebases from 15 years ago

1 month agoby swiftcoder

I would love to see someone do a longitudinal study of the incident/error rate of a canary container in prod that is managed by claude. Basically doing a control/experimental group to prove who does better the Humans or the AI?

1 month agoby maerF0x0

What would happen if you gave the same task to 200 human contractors?

I suspect SLOC growth wouldn't be quite as dramatic but things like converting everything to Rust's error handling approach could easily happen.

1 month agoby fauigerzigerk

This makes me wonder what the result would be of having an AI turn a code base into literate-programming style, and have it iterate on that to improve the “literacy”.

1 month agoby layer8

Ok SRS question: What's the best "Code Review" Skill/Agent/Prompt that I can use these days? Curious to see even paid options if anyone knows?

1 month agoby orliesaurus

This is actually a great idea. It's like those AI resampled this image 10,000 times. Or JPEG iteratively compressed this picture 1 Million times.

1 month agoby keepamovin

When I ask coding agents to add tests, they often come up with something like this:

    const x = new NewClass();
    assert.ok(x instanceof NewClass);

So I am not at all surprised about Claude adding 5x tests, most of which are useless.

It's going to be fun to look back at this and see how much slop these coding agents created.

1 month agoby g947o

There's probably a human manager going "Great! How cone I can't get my engineering team to ship this much QUALITY?"

1 month agoby bitwize

"Core Functional Utilities: Identity function - returns its input unchanged." is one of my favorites from `lib/functional.ts`.

1 month agoby gm678

It is something I noticed when talking to LLMs, if they don't get it right the first time, they probably never will, and if you really insist, the quality starts to degrade.

It is not unlike people, the difference being that if you ask someone the same thing 200 times, he will probably going to tell you to go fuck yourself, or, if unable to, turn to malicious compliance. These AIs will always be diligent. Or, a human may use the opportunity to educate himself, but again, LLMs don't learn by doing, they have a distinct training phase that involves ingesting pretty much everything humanity has produced, your little conversation will not have a significant effect, if at all.

1 month agoby GuB-42

Pasting this whole article in to claude code "improve my codebase taking this article in to account"

1 month agoby phildougherty

Would be nice if every article about LLM/AI had that as a tag so you could skip past them...

1 month agoby v3xro

`--dangerously-skip-permissions` why?

1 month agoby mvanbaak

It behaved exactly like 99% of developers, introducing unnecessary complexity.

1 month agoby chr15m

'In some iterations, coding agent put on a hat of security engineer. For instance - it created a hasMinimalEntropy function meant to "detect obviously fake keys with low character variety". I don't know why.'

Yes, you do know why. Because somewhere in its training, that functionality was linked to "quality" or "improvement". Remember what these things do at their core: really good auto-complete.

'The prompt, in all its versions, always focuses on us improving the codebase quality. It was disappointing to see how that metric is perceived by AI agent.'

Really? It's disappointing to see how that metric is perceived by humans, and the AIs are trained on things humans made. If people can't agree on "codebase quality", especially the ones who write loudly about it on the intetnet, it's going to be impossible for AI agents to agree. A better prompt actually specifying what _you_ consider to be improvements would have been so much better: perhaps minimize 3rd party deps, or minimize local utils reimplementing existing 3rd party libs, or add quality typechecks.

'The leading principle was to define a few vanity metrics and push for "more is better".'

Yeah, because this is probably the most common thing it saw in training. Programmers actually making codebase quality improvements are just quietly doing it, while the ones shouting on the internet (hence into the training data) about how their [bad] techniques [appear to] improve quality are also the ones picking vanity metrics and pushing for "more is better".

'I've prompted Claude Code to failure here'

Not really a failure: it did exactly what you asked: impoved "codebase quality" according to its training data. If you _required_ a human engineer to do the same thing 200 times, you'd get similar results as they run out of real improvements and start scouring the web for anything that anybody ever considered an "improvement", which very definitely includes vanity metrics and "more is better" regarding test count and coverage. You just showed that these AIs aren't much more than their training data. It's not actually thinking about quality, it's just barfing up things it has seen called "codebase quality improvements", regardless of the actual quality of those improvements.

1 month agoby just6979

> This app is around 4-5 screens. The version "pre improving quality" was already pretty large. We are talking around 20k lines of TS

Fucking yikes dude. When's the last time it took you 4500 lines per screen, 9000 including the JSON data in the repo????? This is already absolute insanity.

I bet I could do this entire app in easily less than half, probably less than a tenth, of that.

1 month agoby jesse__

You need to scroll the windows to see all the numbers. (Why??)

1 month agoby VikingCoder

The prompt was:

  Ultrathink. You're a principal engineer. Do not ask me any
  questions. We need to improve the quality of this codebase.
  Implement improvements to codebase quality.

I'm a little disappointed that Claude didn't eventually decide to start removing all of the cruft it had added to improve the quality that way instead.

1 month agoby simonw

Did it create 200 CODE_QUALITY_IMPROVEMENTS.md files by chance?

1 month agoby pawelduda

Don't use cloc in 2025. Use tokei or whatever.

1 month agoby 29athrowaway

Interesting experiment. Looking at this I immediately thought similar experiment run by Google: AlphaEvolve. Throwing LLM compute at problems might work if the problem is well defined and the result can be objectively measured.

As for this experiment: What does quality even mean? Most human devs will have different opinions on it. If you would ask 200 different devs (Claude starts from 0 after each iteration) to do the same, I have doubts the code would look much better.

I am also wondering what would happen if Claude would have an option to just walk away from the code if its "good enough". For each problem most human devs run cost->benefit equation in their head, only worthy ideas are realized. Claude does not do it, the code writing cost is very low on his site and the prompt does not allow any graceful exit :)

1 month agoby thald

for all the bad code havoc was most certainly not 'wrecked', it may have been 'wreaked' though . . .

1 month agoby 6LLvveMx2koXfwn

that's my experience with AI, most times it creates an overengineered solution unless told it to keep it simple

1 month agoby guluarte

This strikes me as a very solid methodology for improving the results of all AI coding tools. I hope Anthropic, etc take this up.

Rather than converging on optimal code (Occam's Razor for both maintainability and performance) they are just spewing code all over the scene. I've noticed that myself, of course, but this technique helps to magnify and highlight the problem areas.

It makes you wonder how much training material was/is available for code optimization relative to training material for just coding to meet functional requirements. And therefore, what's the relative weight of optimizing code baked into the LLMs.

1 month agoby SKILNER

next time, have the LLM alternate between these two steps:

- Do some work - Critique the work

it will converge better

1 month agoby arconis987

Am I the only one that is surprised that the app still works?!

1 month agoby etamponi

Well, given it can't say "no, I think it's good enough now", you'll just get madness, no?

1 month agoby stavros

> In message log, the agent often boasts about the number of tests added, or that code coverage (ugh) is over some arbitrary percentage. We end up with an absolute moloch of unmaintainable code in the name of quality. But hey, the number is going up.

Oh hey, just like real developers!

1 month agoby KronisLV