HN Reader

Will AI systems perform poorly due to AI-generated material in training data?

113

142

> On two occasions I have been asked, 'Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?' I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.

2 days agoby evan_

> model collapse happens when the training data no longer matches real-world data

This isn't a significant issue IMO, as human-created content isn't "real-world" per se; it's human-created world, an interpretation and representation of the real. The real world is the raw data perceived by sensors, human or machine. And while model-generated content doesn't match human-created content well, in the vast majority of cases it's still humans curating, modifying and publishing generated content, based on how useful it is (there are of course spammers, etc but that's a general issue). This is something humans do with content created by other humans too.

So over time generated content will become a sort of norm adopted by and inevitably molding humans, same as created content does. Instead of model collapse, both sources of content will converge over time, particularly as the ability to also generate content directly from the real world is developed and integrated into multi-modal models.

2 days agoby skeledrew

My intelligence is trained by paying close attention to who is doing the talking. Some people know a lot about one topic which means they didn't spend all of that time learning other things. Many don't know this about themselves.

Wikipedia had some comical instances where high quality contributors accident ventured into other areas where they spontaneously transformed into ignorant trolls.

2 days agoby econ

These companies are sitting on a never-ending stream of human created data. What do you think happens to your conversations or other interactions with AI? Quality might be a bit sus though.

2 days agoby Rabbit_Brave

I wonder if anyone's made a version of Disintegration Loops as an LLM artwork?

Recursively retrained their own LLM on it's own output until it descends into gibberish in amusing or artistic ways?

https://en.wikipedia.org/wiki/The_Disintegration_Loops

2 days agoby bigiain

The reality is that for the most part, any corpus created after 2022 is going to be seriously polluted.

2 days agoby declan_roberts

Considering most recent models' general knowledge cutoffs are still in the late 2023/early 2024 range, I'm guessing the answer is "yes, and AI companies are very much aware of it".

2 days agoby bakugo

A good way to harvest new training material is to eavesdrop real human conversations from non polluted sources (such as microphones listening to people talk in public places or texts), transcribe them, and feed them to LLMs.

2 days agoby deadbabe

This reminds me of the Monsanto case, where they sued a farmer (and won) for using patented seeds that the farmer obtained from a local grain elevator which happened to contain some of Monsanto's seeds.

Should it eventually happen for LLM outputs, I hope we name it Slop Wars.

2 days agoby anonygler

I'd venture no.

In fact I wouldn't be surprised if this tainted information somehow enriches a dataset by providing an extra dimensionality for training specialized heuristics. Maybe this would turn out to be how LLM hallucination can be solved, through being able to accurately identify AI generated material, and as result, becoming more robust against both the identification and generation of nonsense.

Humans learn to discern what/who to best pay attention to via all manners of heuristics. I don't see in principle why LLMs (or something like it in the future) won't eventually be able to do the same.

2 days agoby stevenhuang

Has the quality of art gone down since art was invented? Or has the quality of the written text gone down since writing was invented? I think the answer is clear no.

Humans have been trained on "human-generated data" (cultural artifacts) for centuries, and quality is not down. AI is only an accelerator of this process, but there is nothing inherent in creating "artifacts" that would pollute the original training data.

If anything, we should be worried about destroying nature, because that's the original inspiration for human-produced artifacts.

2 days agoby js8

A scarier thought is that people will "talk" so much with these AIs that they'll start talking like ChatGPT. So we may still end up with some AI enshittification fixed point in the future but, one of the feedback paths will be human brains become enshittified.

Imagine you time travel 20 years in the future and find out everyone around you talks the same and they all like ChatGPT.

2 days agoby rdtsc

My prediction is that things will go the opposite way and AIs will become progressively more accurate as they get better at fact checking and reasoning.

Already LLMs like chatgpt can be fairly unbiased on things like was the economy better under Trump or Biden whereas humans tend to be very biased on that depending on which information sources they have been fed. Humans definitely perform poorly as voters due to shill-generated material in training data.

1 day agoby tim333

Yes. See how easy that is? Saved you 15 minutes.

2 days agoby r33b33

Time for GANs to make a resurgence?

2 days agoby blooddragon

I've heard that OpenAI and many AI labs put watermarks [0] in their LLM outputs to detect AI-generated content and filter it out.

[0] Like statistics of words, etc.

2 days agoby behnamoh

I mean, if these AIs have read everything there is to read, then really what more do we want from them?

2 days agoby Balgair

> Will future artificial intelligence systems perform increasingly poorly due to AI-generated material in their training data?

No. Synthetic data is being used to improve LLMs

2 days agoby adamgordonbell

Today we have humans being trained on llm garbage - kids using it to do their homework - programmers using it to "learn" how to code, med students cheating their way through med school, etc. So the content the humans are producing and will produce is really just LLM statistical word jumbles - ie human generated content will soon be as useless as LLM generated content.

2 days agoby jacobsenscott

I can't lie, I miss when the only GPT I had to worry about was the GUID Partition Table.

2 days agoby leoapagano

Shadow libraries

2 days agoby carlosjobim

i can't believe this article wasn't written 2 years ago, this is just the basics man

2 days agoby stainablesteel

The "core reasoning" part of AI may be increasingly important to improve, and its "database of factual knowledge" aspects may be less and less important, maybe increasingly a hindrance. So more focused and specialized training may take over toward increasing reasoning precision, and not this never-ending stream of new data.

So maybe we'll get better reasoning and therefore better generated data/content in the wild, without this negative feedback loop everyone is worried about.

2 days agoby mondrian

Unfortunately, I don't really know if I can trust academics to analyze the development of large language models. No academic team has built an LLM. So... do people working at Stanford or Oxford really have good insight how LLMs are developed?

If people at OpenAI, Anthropic, or Google said this, that would be interesting. But I don't think it makes sense any more to treat academic computer scientists as relevant experts here.

2 days agoby lacker