It feels to me that the hypothesis of this research was somewhat "begging the question". Reasoning models are trained to spit some tokens out that increase the chance of the models spitting the right answer at the end. That is, the training process is singularly optimizing for the right answer, not the reasoning tokens.
Why would you then assume the reasoning tokens will include hints supplied in the prompt "faithfully"? The model may or may not include the hints - depending on whether the model activations believe those hints are necessary to arrive at the answer. In their experiments, they found between 20% and 40% of the time, the models included those hints. Naively, that sounds unsurprising to me.
Even in the second experiment when they trained the model to use hints, the optimization was around the answer, not the tokens. I am not surprised the models did not include the hints because they are not trained to include the hints.
That said, and in spite of me potentially coming across as an unsurprised-by-the-result reader, it is a good experiment because "now we have some experimental results" to lean into.
Kudos to Anthropic for continuing to study these models.