HN Reader

Sacrificing accessibility for not getting web scraped

Congratulations, I guess? I can't read your content. But ... The machines can't either, so ... great job!

Although... Hmm! I just pasted it into Claude and got:

When text content gets scraped from the web, and used for ever-increasing training data to improve. Copyright laws get broken, content gets addressively scraped, and even though you might have deleted your original work, it might must show up because it got cached or archived at some point. Now, if you subscribe to the idea that your content shouldn't be used for training, you don't have much say. I wondered how I personally would mitigate this on a technical level. et tu, caesar? In my linear algebra class we discussed the caesar cipher[1] as a simple encryption algorithm: Every character gets shifted by n characters. If you know (or guess) the shift, you can figure out the original text. Brute force or character heuristics break this easily. But we can apply this substitution more generally to a font! A font contains a cmap (character map), which maps codepoints and glyphs. A codepoint defines the character, or complex symbol, and the glyph represents the visual shape. We scramble the font's codepoint-glyph-mapping, and adjust the text with the inverse of the scramble, so it stays intact for our readers. It displays correctly, but the inspected (or scraped) HTML stays scrambled. Theoretically, you could apply a different scramble to each request. This works as long as scrapers don't use OCR for handling edge cases like this, but I don't think it would be feasible. I also tested if ChatGPT could decode a ciphertext if I'd tell it that a substitution cipher was used, and after some back and forth, it gave me the result: "One day Alice went down a rabbit hole,

How accurate is this?

Did you seriously just make things worse for screen reader users and not even ... verify ... it worked to make things worse for AI?

1 month agoby ctoth

At least the blog author is self-aware about making accessibility worse? I just found it funny how reactionary and backfire-y this was.

(In politics, a reactionary is a person who favors a return to a previous state of society which they believe possessed positive characteristics absent from contemporary society.)

1 month agoby 1gn15

That's cool. Hopefully you never post any remotely interesting, because in my very human 2010's way of doing things, I cannot even select and copy some text to my personal notes.

This goes well beyond accessibility and bots. I guess the Reader mode, a basic web browser feature meant precisely to read articles, wasn't an expected use case either?

1 month agoby j1elo

I handed a PDF capture to Opus 4.5 (since web fetch returned a 403) and it was able to crack the cypher pretty quickly. Fun idea, but I think this only prevents humans from using the site.

1 month agoby striking

I have a dumb question: What if we put a simple password in front of every website that everyone knew, like "password". Upon click of login, the user agrees to the terms of service which exclude all automatic scraping.

I know this is a dumb idea, but I would love to know exactly why.

1 month agoby consumer451

My RSS reader tried to scrape the page from the Hacker News feed so that I could read it and it was all garbled ): This breaks so many things, including translation and screen readers.

Would you be allowed to do this in some countries commercially because of accessibility laws?

1 month agoby JC1NC

Here's the problem, you're still going to get scraped and the LLM will understand it anyway. Maybe at best you'll get filtered out of the dataset b/c it's high perplexity text?

1 month agoby darknoon

AFAIK at least the comet browser uses OCR, so I worry that the "OCR not feasible" argument is sadly wrong.

1 month agoby ArcHound

This is a fascinating piece. How feasible is something like this on any broader scale? I find that for my blog, losing copy-paste would immediately outweigh the benefits of not getting scraped. Am I missing the downsides of being the scrapee?

1 month agoby lumirth

Jeez, all this "LLM avoidance" is so horribly silly. We will remember it in 10 years like we now remember the decision to setup our great European cookie popup laws. As well intentioned, but very detrimental.

You will not stop scrapers. Period. They will just pay for a service like firecrawl that will fix it for them. Here in Poland one of the most notorious sites implementing anti-bot tech is our domestic eBay competitor allegro.pl. I've been locked out of that site for "clicking too fast" more than once. They have the strictest, most inconvenient software possible (everyone uses the site). And yet firecrawl has no problem scraping them (although rather slowly).

Second argument against these "protections" is, there are people behind bots. Many bot requests are driven by a human asking "find me cheapest rtx5060 ti 16gb" today. If your site blocks it they will loose that sale.

1 month agoby Roark66

I can read it on safari, but not on firefox.

1 month agoby oidar

I mean, the core problem is the same as DRM: you want to distribute your data without actually distributing it. The endgame is simply rendering and OCRing, which will likely be feasible at scale, but in the meantime, you've cut off a bunch of your legit audience, broken RSS, broken search engine indexing, and AI can still scrape it when requested.

1 month agoby rpdillon