HN Reader

266

220

Most web scrapers, even if illegal, are for... business. So they scrape amazon, or shops. So yeah. Most unwanted traffic is from big tech, or bad actors trying to sniff vulnerabilities.

I know a thing or two about web scraping.

There are sometimes status codes 404 for protection, so that you skip this site, so my crawler tries, as a hammer, several of faster crawling methods (curlcffi).

Zip bombs are also not for me. Reading header content length is enough to not read the page/file. I provide byte limit to check if response is not too big for me. For other cases reading timeout is enough.

Oh, and did you know that requests timeout is not really timeout a timeout for page read? So server can spoonfeed you bytes, one after another, and there will be no timeout.

That is why I created my own crawling system to mitigate these problems, and have one consistent mean of running selenium.

https://github.com/rumca-js/crawler-buddy

Based on library

https://github.com/rumca-js/webtoolkit

3 months agoby renegat0x0

I'm not overly surprised, it's probably faster to search the text for http/https than parse the DOM

3 months agoby rokkamokka

Fun to see practical applications of interesting research[1]

[1]https://news.ycombinator.com/item?id=45529587

3 months agoby sharkjacobs

It doesn't seem that abusive. I don't comment things out thinking "this will keep robots from reading this".

3 months agoby Noumenon72

The title is confusing, should be "commented-out".

3 months agoby stevage

when I used to crawl the web, battle tested Perl regexes were more reliable than anything else, commented urls would have been added to my queue.

3 months agoby latenightcoding

Sounds like you should give the bots exactly what they want... a 512MB file of random data.

3 months agoby bigbuppo

Two thoughts here when it comes to poisoning unwanted LLM training data traffic

1) A coordinated effort among different sites will have a much greater chance of poisoning the data of a model so long as they can avoid any post scraping deduplication or filtering.

2) I wonder if copyright law can be used to amplify the cost of poisoning here. Perhaps if the poisoned content is something which has already been shown to be aggressively litigated against then the copyright owner will go after them when the model can be shown to contain that banned data. This may open up site owners to the legal risk of distributing this content though… not sure. A cooperative effort with a copyright holder may sidestep this risk but they would have to have the means and want to litigate.

3 months agoby mikeiz404

> most likely trying to non-consensually collect content for training LLMs

No, it's just background internet scanning noise

3 months agoby throw_me_uwu

Well, if they’re going to request commented out scripts, serve them up some very large scripts…

3 months agoby sokoloff

I blame modern CS programs that don't teach kids about parsing. The last time I looked at some scraping code, the dev was using regexes to "parse" html to find various references.

Maybe that's a way to defend against bots that ignore robots.txt, include a reference to a Honeypot HTML file with garbage text, but include the link to it in a comment.

3 months agoby OhMeadhbh

>These were scrapers, and they were most likely trying to non-consensually collect content for training LLMs.

"Non-consensually", as if you had to ask for permission to perform a GET request to an open HTTP server.

Yes, I know about weev. That was a travesty.

3 months agoby bakql