HN Reader

It shouldn't be difficult at all. When you record original music, or write something down on paper, it's instantly copyrighted. Why shouldn't that same legal precedent apply to content on the internet?

This is half a failure of elected representatives to do their jobs, and half amoral tech companies exploiting legal loopholes. Normal people almost universally agree something needs to be done about it, and the conversation is not a new one either.

5 months agoby johnnienaked

What is this war about?

I was looking at another thread about how Wikipedia was the best thing on the internet. But they only got the head start by taking copy of Encyclopedia Britannica and everything else is a

And now the corpus is collected, what difference does a blog post make, does it nudge the dial to comprehension 0.001% in a better direction? How many blog posts over how many weeks makes the difference.

5 months agoby Popeyes

I continue to have contempt for the "I'm not contributing to the enrichment of our newest and most powerful technology" gang. I do not accept the assertion that my AI should have any less access to the internet that we all pay for than I do.

If guys like this have their way, AI will remain stupid and limited and we will all be worse off for it.

5 months agoby tqwhite

I find this whole anti-LLM stance so weird. It kind of feels like trying to build robot distractions into websites to distract search engine indexers in the 2000's or something.

Like why? Don't you want people to read your content? Does it really matter that meat bags find out about your message to the world through your own website or through an LLM?

Meanwhile, the rest of the world is trying to figure out how to deliberately get their stuff INTO as many LLMs as fast as possible.

5 months agoby bboygravity

Personally, I would not want to be on the side of people openly saying that they are poisoning the well.

5 months agoby karahime

In my opinion, colonialism was significantly worse than web crawlers being used to train LLMs.

5 months agoby wilg

There are two common misconceptions in this post.

The first isn't worth arguing against: it's the idea that LLM vendors ignore your robots.txt file even when they clearly state that they'll obey it: https://platform.openai.com/docs/bots

Since LLM skeptics frequently characterize all LLM vendors as dishonest mustache-twirling cartoon villains there's little point trying to convince them that companies sometimes actually do what they say they are doing.

The bigger misconception though is the idea that LLM training involves indiscriminately hoovering up every inch of text that the lab can get hold of, quality be damned. As far as I can tell that hasn't been true since the GPT-3 era.

Building a great LLM is entirely about building a high quality training set. That's the whole game! Filtering out garbage articles full of spelling mistakes is one of many steps a vendor will take in curating that training data.

5 months agoby simonw

> According to Google, it’s possible to verify Googlebot by matching the crawler’s IP against a list of published Googlebot IPs. This is rather technical and highly intensive

Wat. Blocklisting IPs is not very technical (for someone running a website that knows + cares about crawling) and is definitely not intensive. Fetch IP list, add to blocklist. Repeat daily with cronjob.

Would take an LLM (heh) 10 seconds to write you the necessary script.

5 months agoby fastball

Not every bot that ignores your robots.txt is necessarily using that data.

What some bots do is they first scrape the whole site, then look at which parts are covered by robots.txt, and then store that portion of the website under an “ignored” flag.

This way, if your robots.txt changes later, they don’t have to scrape the whole site again, they can just turn off the ignored flag.

5 months agoby deadbabe

>One of the many pressing issues with Large Language Models (LLMs) is they are trained on content that isn’t theirs to consume.

One of the many pressing issues is that people believe that ownership of content should be absolute, that hammer makers should be able to dictate what is made with hammers they sell. This is absolutely poison as a concept.

Content belongs to everyone. Creators of content have a limited term, limited right to exploit that content. They should be protected from perfect reconstruction and sale of that content, and nothing else. Every IP law counter to that is toxic to culture and society.

5 months agoby protocolture