Thank you for posting this PaulPauls,
can I please ask a wacky question that I have about mech.interp. ?
we know that when we use a metric to measure humans, we game the metric, I wonder about future ai, gaming mech.interp.
for simplicity let me propose a 2d matrix to encode tokens, and hope my understanding of neural networks is good enough for this to be a valid question
I understand that higher dimensions have a lot more 'space', so accidental collisions are easily avoided, but 2d makes my example easier to explain.
if we had the following tokens mapped in 2d space
Apple 1a
Pear 1b
Donkey 2a
Horse 2b
it becomes impossible to understand if the neurons 1,2,a,b, all excited represents apple and horse or donkey and pear?
I wonder if a future, vastly more competent AI overseeing its own training might use this form of ambiguity as means to create a model in which the weights are selected so the potential for encoding collisions exist, so that it is possible to deceive an mech.int. observer as to the true thoughts of the model, essentially enabling the ai to think in euphemisms?