Context windows are now 1M+ tokens, but context depth is limited. Often, the answer is hidden behind layers of linked information, but an attention block can only resolve one link at a time. We trained a tiny 5 layer model that beats GPT-4.5 on a variable evaluation task requiring deep, recursive reasoning. How? It learned a divide and conquer mechanism.
10 hours agoby michael_lutz