There are two common misconceptions in this post.
The first isn't worth arguing against: it's the idea that LLM vendors ignore your robots.txt file even when they clearly state that they'll obey it: https://platform.openai.com/docs/bots
Since LLM skeptics frequently characterize all LLM vendors as dishonest mustache-twirling cartoon villains there's little point trying to convince them that companies sometimes actually do what they say they are doing.
The bigger misconception though is the idea that LLM training involves indiscriminately hoovering up every inch of text that the lab can get hold of, quality be damned. As far as I can tell that hasn't been true since the GPT-3 era.
Building a great LLM is entirely about building a high quality training set. That's the whole game! Filtering out garbage articles full of spelling mistakes is one of many steps a vendor will take in curating that training data.