Factorization is key here. It separates dataset-level structure from observation-level computation so the model doesn't waste capacity rediscovering structure.
I've been arguing the same for code generation. LLMs flatten parse trees into token sequences, then burn compute reconstructing hierarchy as hidden states. Graph transformers could be a good solution for both: https://manidoraisamy.com/ai-mother-tongue.html