HN Reader

Reasoning Models Reason Well, Until They Don't

109

I find that they know what they know fairly well, but if you move beyond that, into what can be reasoned from what they know, they have a profound lack of ability to do that. They are good at repeating their training data, not thinking about it.

The problem, I find, is that they then don't stop, or say they don't know (unless explicitly prompted to do so) they just make stuff up and express it with just as much confidence.

8 hours agoby My_Name

The key point the paper seems to make is that existing benchmarks have relatively low complexity on reasoning complexity, so they made a new dataset DeepRD with arbitrarily large reasoning complexity and demonstrated that existing models fail at a complex enough problem. Complexity is defined from the complexity of a graph created by modeling the problem as a graph and determining the traversals needed to go from some source node to a target node.

My main critique is that I don't think there's evidence that this issue would persist after continuing to scale models to be larger and doing more RL. With a harness like what coding agents do these days and with sufficient tool use, I bet models could go much further on that reasoning benchmark. Otherwise, if the reasoning problem were entirely done within a single context window, it's expected that a complex enough reasoning problem would be too difficult for the model to solve.

8 hours agoby alyxya

It's simple. Don't ingest more than 40KB at a time into its LLM's RAG pipe and its hallucination goes way, way down.

Preferably like not at the start and best not to do more than 40KB at a time at all.

That's how I learned how to deal with nftables' 120KB parser_bison.y file by breaking them up into clean sections.

All of a sudden, a fully-deterministic LL(1) full semantic pathway of nftables' CLI syntax appears before my very eye (and spent hours validating it): 100% and test generators now can permutate crazy test cases with relative ease.

Cue in Joe Walsh's "Life's Been Good To Me".

7 hours agoby egberts1

Has any one ever found an ML/AI paper that make claims that RLMs can reason?

When I prompt an RLM, I can see it spits out reasoning steps. But I don't find that evidence RLMs are capable of reasoning.

8 hours agoby hirako2000

But I also fail catastrophically once a reasoning problem exceeds modest complexity.

9 hours agoby equinox_nl

Is that really the best title the authors could come up with?

Up next: "Lawn mowers are good at cutting grass until they aren't"

7 hours agoby lingrush4

I think the explanation is pretty simple, as I said in my earlier comment: https://news.ycombinator.com/item?id=44904107

I also believe the problem is we don't know what we want: https://news.ycombinator.com/item?id=45509015

If we could make LLMs to apply a modest set of logic rules consistently, it would be a win.

8 hours agoby js8

What confused me is the fact that in the paper all logical steps are give. It basically check that when all relevant facts are provided explicitly as links , how far and how complex a chain can the model correctly follow before it breaks down?

So it's simpler than "reasoning". This is not necessarily a bad thing as it boils down the reasoning to a simpler, more controlled sub problem.

8 hours agoby flimflamm

I wonder if we can get models to reason in a structured and verifiable way, like we have formal logic in math.

9 hours agoby brap

LLMs falter because likelihood-driven pattern completion doesn’t enforce coherence across uncertainty (probability), representation (geometry), composition (category), and search (reasoning). To get robust reasoning, we need these layers to be explicit, typed, and mutually constraining—with verification and calibrated belief updates in the loop.

I was interviewed about this recently, and mentioned the great work of a professor of CS and Law who has been building the foundations for this approach. My own article about it was recently un-linked due to a Notion mishap (but available if anyone is interested - I have to publish it again)

https://www.forbes.com/sites/hessiejones/2025/09/30/llms-are...

8 hours agoby nakamoto_damacy

I'm yet to see a task that AI fails at that bottom 10% of population wouldn't also fail at.

8 hours agoby anal_reactor

It’s because they generate a seeming of reasoning, and don’t actually reason!

(Slams the door angrily)

(stomps out angrily)

(touches the grass angrily)

9 hours agoby WesolyKubeczek

> [...] recent studies show that transformers and LLMs fail catastrophically once reasoning problems exceed modest complexity. We revisit these findings through the lens of large reasoning models (LRMs) -- LLMs fine-tuned with incentives for step-by-step argumentation and self-verification

This was the obvious outcome of the study (don't get me wrong, obvious outcomes are still worth having research on).

"LRMs" *are* just LLMs. There's no such thing as a reasoning model, it's just having an LLM write a better prompt than the human would and then sending it to the LLM again.

Despite what Amodei and Altman want Wall Street to believe, they did not suddenly unlock reasoning capabilities in LLMs by essentially just running two different prompts in sequence to answer the user's question.

The truly amazing thing is that reasoning models show ANY improvement at all compared to non-reasoning models, when they're the same exact thing.

9 hours agoby iLoveOncall

LLMs are like students, they can reason a bit, but real understanding still takes time and practice.

8 hours agoby devlogstream