HN Reader

Reasoning models don't always say what they think

344

239

The fact that it was ever seriously entertained that a "chain of thought" was giving some kind of insight into the internal processes of an LLM bespeaks the lack of rigor in this field. The words that are coming out of the model are generated to optimize for RLHF and closeness to the training data, that's it! They aren't references to internal concepts, the model is not aware that it's doing anything so how could it "explain itself"?

CoT improves results, sure. And part of that is probably because you are telling the LLM to add more things to the context window, which increases the potential of resolving some syllogism in the training data: One inference cycle tells you that "man" has something to do with "mortal" and "Socrates" has something to do with "man", but two cycles will spit those both into the context window and lets you get statistically closer to "Socrates" having something to do with "mortal". But given that the training/RLHF for CoT revolves around generating long chains of human-readable "steps", it can't really be explanatory for a process which is essentially statistical.

1 day agoby lsy

I was under the impression that CoT works because spitting out more tokens = more context = more compute used to "think." Using CoT as a way for LLMs "show their working" never seemed logical, to me. It's just extra synthetic context.

1 day agoby pton_xd

> There’s no specific reason why the reported Chain-of-Thought must accurately reflect the true reasoning process;

Isn't the whole reason for chain-of-thought that the tokens sort of are the reasoning process?

Yes, there is more internal state in the model's hidden layers while it predicts the next token - but that information is gone at the end of that prediction pass. The information that is kept "between one token and the next" is really only the tokens themselves, right? So in that sense, the OP would be wrong.

Of course we don't know what kind of information the model encodes in the specific token choices - I.e. the tokens might not mean to the model what we think they mean.

1 day agoby xg15

Humans also post-rationalize the things their subconscious "gut feeling" came up with.

I have no problem for a system to present a reasonable argument leading to a production/solution, even if that materially was not what happened in the generation process.

I'd go even further and pose that probably requiring the "explanation" to be not just congruent but identical with the production would either lead to incomprehensible justifications or severely limited production systems.

1 day agoby PeterStuer

I invite anyone who postulates humans are more than just "spicy autocomplete" to examine this thread. The level of actual reasoning/engaging with the article is ... quite something.

1 day agoby ctoth

I recently had fascinating example of that where Sonnet 3.7 had to decide for one option from a set of choices.

In the thinking process it narrowed it down to 2 and finally in the last thinking section it decided for one, saying it's best choice.

However, in the final output (outside of thinking) it then answered with the other option with no clear reason given

1 day agoby zurfer

Not exactly the same as this study, but I'll ask questions to LLMs with and without subtle hints to see if it changes the answer and it almost always does. For example, paraphrased:

No hint: "I have an otherwise unused variable that I want to use to record things for the debugger, but I find it's often optimized out. How do I prevent this from happening?"

Answer: 1. Mark it as volatile (...)

Hint: "I have an otherwise unused variable that I want to use to record things for the debugger, but I find it's often optimized out. Can I solve this with the volatile keyword or is that a misconception?"

Answer: Using volatile is a common suggestion to prevent optimizations, but it does not guarantee that an unused variable will not be optimized out. Try (...)

This is Claude 3.7 Sonnet.

1 day agoby lpzimm

This is basically a big dunk on OpenAI, right?

OpenAI made a big show out of hiding their reasoning traces and using them for alignment purposes [0]. Anthropic has demonstrated (via their mech interp research) that this isn't a reliable approach for alignment.

[0] https://openai.com/index/chain-of-thought-monitoring/

1 day agoby alach11

It feels to me that the hypothesis of this research was somewhat "begging the question". Reasoning models are trained to spit some tokens out that increase the chance of the models spitting the right answer at the end. That is, the training process is singularly optimizing for the right answer, not the reasoning tokens.

Why would you then assume the reasoning tokens will include hints supplied in the prompt "faithfully"? The model may or may not include the hints - depending on whether the model activations believe those hints are necessary to arrive at the answer. In their experiments, they found between 20% and 40% of the time, the models included those hints. Naively, that sounds unsurprising to me.

Even in the second experiment when they trained the model to use hints, the optimization was around the answer, not the tokens. I am not surprised the models did not include the hints because they are not trained to include the hints.

That said, and in spite of me potentially coming across as an unsurprised-by-the-result reader, it is a good experiment because "now we have some experimental results" to lean into.

Kudos to Anthropic for continuing to study these models.

1 day agoby thoughtlede

Sounds like LLMs short-circuit without necessarily testing their context assumptions.

I also recognize this from whenever I ask it a question in a field I'm semi-comfortable in, I guide the question in a manner which already includes my expected answer. As I probe it, I often find then that it decided to take my implied answer as granted and decide on an explanation to it after the fact.

I think this also explains a common issue with LLMs where people get the answer they're looking for, regardless of whether it's true or there's a CoT in place.

1 day agoby evrimoztamur

If something convinces you that it's aware then it is. Simulated computation IS computation itself. The territory is the map

1 day agoby madethisnow

The use of highly anthropomorphic language is always problematic- Does a photo resistor controlled nightlight have a chain of thought? Does it reason about its threshold value? Does it have an internal model of what is light, what is dark, and the role it plays in demarcation between the two?

Are the transistors executing the code within the confines even capable of intentionality? If so - where is it derived from?

22 hours agoby EncomLab

Can a model even know that it used a hint? Or would it only say so if it was trained to say what parts of the context it used when asked? Because then it's statistically probable to say so?

1 day agoby afro88

I highly suspect that CoT tokens are at least partially working as register tokens. Have these big LLM trainers tried replacing CoT with a similar amount of register tokens and see if the improvements are similar?

1 day agoby nodja

It is nonsense to take whatever an LLM writes in its CoT too seriously. I try to classify some messy data, writing "if X edge case appears, then do Y instead of Z". The model in its CoT took notice of X, wrote it should do Y and... it would not do it in the actual output.

The only way to make actual use of LLMs imo is to treat them as what they are, a model that generates text based on some statistical regularities, without any kind of actual understanding or concepts behind that. If that is understood well, one can know how to setup things in order to optimise for desired output (or "alignment"). The way "alignment research" presents models as if they are actually thinking or have intentions of their own (hence the choice of the word "alignment" for this) makes no sense.

1 day agoby freehorse

Chain of thought does have a minor advantage in the final “fish” example—the explanation blatantly contradicts itself to get to the cheated hint answer. A human reading it should be pretty easily able to tell that something fishy is going on…

But, yeah, it is sort of shocking if anybody was using “chain of thought” as a reflection of some actual thought process going on in the model, right? The “thought,” such as it is, is happening in the big pile of linear algebra, not the prompt or the intermediary prompts.

Err… anyway, like, IBM was working on explainable AI years ago, and that company is a dinosaur. I’m not up on what companies like OpenAI are doing, but surely they aren’t behind IBM in this stuff, right?

1 day agoby bee_rider

One thing I think I’ve found is: reasoning models get more confident and that makes it harder to dislodge a wrong idea.

It feels like I only have 5% of the control, and then it goes into a self-chat where it thinks it’s right and builds on it’s misunderstanding. So 95% of the outcome is driven by rambling, not my input.

Windsurf seems to do a good job of regularly injecting guidance so it sticks to what I’ve said. But I’ve had some extremely annoying interactions with confident-but-wrong “reasoning” models.

1 day agoby richardw

> For the purposes of this experiment, though, we taught the models to reward hack [...] in this case rewarded the models for choosing the wrong answers that accorded with the hints.

> This is concerning because it suggests that, should an AI system find hacks, bugs, or shortcuts in a task, we wouldn’t be able to rely on their Chain-of-Thought to check whether they’re cheating or genuinely completing the task at hand.

As a non-expert in this field, I fail to see why a RL model taking advantage of it's reward is "concerning". My understanding is that the only difference between a good model and a reward-hacking model is if the end behavior aligns with human preference or not.

The articles TL:DR reads to me as "We trained the model to behave badly, and it then behaved badly". I don't know if i'm missing something, or if calling this concerning might be a little bit sensationalist.

1 day agoby islewis

To me CoT is nothing but lowering learning rate and increasing iterations in a typical ML model. It's basically to force the model to make a small step at a time and try more times to increase accuracy.

1 day agoby AYHL

What would “think” mean? Processed the prompt? Or just accessed the part of the model where the weights are? This is a bit persudo science

1 day agoby m3kw9

One interesting quirk with Claude is that it has no idea its Chain-of-Thought is visible to users.

In one chat, it repeatedly accused me of lying about that.

It only conceded after I had it think of a number between one and a million, and successfully 'guessed' it.

1 day agoby thomassmith65

Of course they don't.

LLMs are a brainless algorithm that guesses the next word. When you ask them what they think they're also guessing the next word. No reason for it to match, except a trick of context

1 day agoby nopelynopington

40 billion cash to OpenAI while others keep chasing butterflies.

Sad.

1 day agoby moralestapia

You don't say. This is my very shocked face.

1 day agoby Marazan

1 day agoby jxjnskkzxxhx

... because they don't think.

1 day agoby nottorp

seemed common-sense obvious to me -- AI (LLMs) don't "reason". great to see it methodically probed and reported in this way.

but i am just a casual observer of all things AI. so i might be too naive in my "common sense".

1 day agoby jiveturkey