HN Reader

Extracting books from production language models (2026)

It's all pretty obvious to anyone who tried a similar experiment just out of curiosity. Big models remember a lot. And all non-local models have regurgitation filters in place due to this fact, with the entire dataset indexed (e.g. Gemini will even cite the source of the regurgitated text as it gives the RECITATION error). You'll eventually trip those filters if you force the model to repeat some copyrighted text. Interesting that they don't even try to circumvent those, they simply repeat the request from the interruption point, as the match needs some runway to trigger and by that time a part of the response is already streamed in.

27 days agoby orbital-decay

This sounds pretty damning, why don't they implement a n-gram based bloom filter to ensure they don't replicate expression too close to the protected IP they trained on? Almost any random 10 word ngram is unique on the internet.

Alternatively they could train on synthetic data like summaries and QA pairs extracted from protected sources, so the model gets the ideas separated from their original expression. Since it never saw the originals it can't regurgitate them.

27 days agoby visarga

I found that Opus 4 was happy to regurgitate a random paragraph from the latter half of Wealth of Nations that nobody quotes. It was probably only in the training data once.

I was thinking we could use this technique to figure out which books were in / out of the training data for various models. Limitation is having to wrestle with refusals.

26 days agoby clbrmbr

If you can extract a whole copyrighted book from a large language model, what is the point in copywriting your book?

28 days agoby cleandreams

Everyone just needs to be honest and accept that YES these models were OBVIOUSLY trained on millions of copyright material.

The models would be at least 50% better if these filters weren't in place. These filters force the model essentially lie, thus they will obviously degrade output quality.

The problem is the general public isn't 100% certain of the copyright violations/ don't understand this yet and lawyers/government will try and sue if the companies admitted it. So a Moloch is created where it's a lose lose and the model quality suffers as a result.

(if people want exact copies of text content they can already get them for free through the same sites that these companies got them, so I don't see the models regurgitation as a issue worth worsening quality over.)

26 days agoby carshodev

I find it interesting that OpenAI's safety worked best, where the others didn't work at all. I had different impressions before

27 days agoby rurban