HN Reader

Perplexity is using stealth, undeclared crawlers to evade no-crawl directives

1240

721

I find this problem quite difficult to solve:

1. If I as a human request a website, then I should be shown the content. Everyone agrees.

2. If I as the human request the software on my computer to modify the content before displaying it, for example by installing an ad-blocker into my user agent, then that's my choice and the website should not be notified about it. Most users agree, some websites try to nag you into modifying the software you run locally.

3. If I now go one step further and use an LLM to summarize content because the authentic presentation is so riddled with ads, JavaScript, and pop-ups, that the content becomes borderline unusable, then why would the LLM accessing the website on my behalf be in a different legal category as my Firefox web browser accessing the website on my behalf?

1 day agoby fxtentacle

Thats... less conclusive than I'd like to see, especially for a content marketing article that's calling out a company in particular. Specifically it's unclear on whether Perplexity was crawling (ie. systematically viewing every page on the site without the direction of a human), or simply retrieving content on behalf of the user. I think most people would draw a distinction between the two, and would at least agree the latter is more acceptable than the former.

1 day agoby gruez

Respone from Perpelexity to Tech Crunch...

>Perplexity spokesperson Jesse Dwyer dismissed Cloudflare’s blog post as a “sales pitch,” adding in an email to TechCrunch that the screenshots in the post “show that no content was accessed.” In a follow-up email, Dwyer claimed the bot named in the Cloudflare blog “isn’t even ours.”

1 day agoby hnburnsy

It's ironic Perplexity itself blocks crawlers:

    $ curl -sI https://www.perplexity.ai | head -1
    HTTP/2 403

Edit: trying to fake a browser user agent with curl also doesn't work, they're using a more sophisticated method to detect crawlers.

1 day agoby rustc

"Stealth" crawlers are always going to win the game.

There are ways to build scrapers using browser automation tools [0,1] that makes detection virtually impossible. You can still captcha, but the person building the automation tools can add human-in-the-loop workflows to process these during normal business hours (i.e., when a call center is staffed).

I've seen some raster-level scraping techniques used in game dev testing 15 years ago that would really bother some of these internet police officers.

[0] https://www.w3.org/TR/webdriver2/

[1] https://chromedevtools.github.io/devtools-protocol/

1 day agoby bob1029

It's entirely possible that it's not Perplexity using the stealth undeclared crawlers, but rather their fallback is to contract out to a dedicated for-pay webscraping firm that retrieves the desired content through unspecified means. (Some of these are pretty dodgy - several scraping companies effectively just install malware on consumer machines and then use their botnet to grab data for their customers.). There was a story on HN not long ago about the FBI using similar means to perform surveillance that would be illegal if the FBI did it itself, but becomes legal once they split the different parts up across a supply chain:

https://news.ycombinator.com/item?id=44220860

1 day agoby nostrademons

Seems a win.

CF being internet police is a problem too but someone credible publicly shaming a company for shady scraping is good. Even if it just creates conversation

Somehow this needs to go back to search era where all players at least attempt to behave. This scrapping Ddos stuff and I don’t care if it kills your site (while “borrowing” content) is unethical bullshit

1 day agoby Havoc

I've built and run a personal search engine, that can do pretty much what perplexity does from a basic standpoint. Testing with friends it gets about 50/50 preference for their queries vs Perplexity.

The engine can go and download pages for research. BUT, if it hits a captcha, or is otherwise blocked, then it bails out and moves on. It pisses me off that these companies are backed by billions in VC and they think they can do whatever they want.

1 day agoby binarymax

the internet needs micropayments (probably millipayments). if crawlers want to pay me a penny a page, crawl me 24-7 plz

if I am willing to pay a penny a page, i and the people like me won't have to put up with clickwrap nonsense

free access doesn't have to be shut off (ok, it will be, but it doesn't have to be, and doesn't that tell you something?)

reddit could charge stiffer fees, but refund quality content to encourage better content. i've fantacized about ideas like "you pay upfront a deposit; you get banned, you lose your deposit; withdraw, have your deposit back", the goal being simplify the moderation task while encouraging quality.

because where the internet is headed is just more and more trash.

here's another idea, pay a penny per search at google/search engine of choice. if you don't like the results, you can take the penny back. google's ai can figure out how to please you. if the pennies don't keep coming in, they serve you ad-infested results; serve up ad-infested results, you can send your penny to a different search engine.

1 day agoby fsckboy

AI companies continuing to have problems with the concept of "consent" is increasingly alarming

god help us if they ever manage to build anything more than shitty chatbots

1 day agoby blibble

This is why Perplexity is my preferred deep search engine. The no-crawl directives don't really make sense when I'm doing research and want my tool of choice to be able to pull from any relevant source. If a site doesn't want particular users to access their content, put it behind a login. The only way I - and eventually many others - will see it in the first place anyway is when it pops up as a cited source in the LLM output, and there's an actual need to go to said source.

1 day agoby skeledrew

Perplexity claims that you can “use the following robots.txt tags to manage how their sites and content interact with Perplexity.” https://docs.perplexity.ai/guides/bots

Their fetcher (not crawler) has user agent Perplexity-User. Since the fetching is user-requested, it ignores robots.txt . In the article, it discusses how blocking the “Perplexity-User” user agent doesn’t actually work, and how perplexity uses an anonymous user agent to avoid being blocked.

1 day agoby kylestanfield

The cat's out of the bag / pandora's box is opened with respect to AI training data.

No amount of robots.txt or walled-gardening is going to be sufficient to impede generative AI improvement: common crawl and other data dumps are sufficiently large, not to mention easier to acquire and process, that the backlash against AI companies crawling folks' web pages is meaningless.

Cloudflare and other companies are leveraging outrage to acquire more users, which is fine... users want to feel like AI companies aren't going to get their data.

The faster that AI companies are excluded from categories of data, the faster they will shift to categories from which they're not excluded.

1 day agoby djoldman

1 day agoby rwmj

Crawling and scraping is legal. If your web server serves the content without authentication, it's legal to receive it, even if it's an automated process.

If you want to gatekeep your content, use authentication.

Robots.txt is not a technical solution, it's a social nicety.

Cloudflare and their ilk represent an abuse of internet protocols and mechanism of centralized control.

On the technical side, we could use CRC mechanisms and differential content loading with offline caching and storage, but this puts control of content in the hands of the user, mitigates the value of surveillance and tracking, and has other side effects unpalatable to those currently exploiting user data.

Adtech companies want their public reach cake and their mass surveillance meals, too, with all sorts of malignant parties and incentives behind perpetuating the worst of all possible worlds.

1 day agoby observationist

In many ways what is going on with Perplexity is reminiscent of the earlier 2000s battles between the p2p music sharing services like Napster and the music industry. Then we had wildly popular services (p2p) where most of the content was being provided illegally without payment to the IP owners.

Which makes it particularly interesting now that Apple is being linked with Perplexity. Because in large part p2p music services were effectively consigned to history by Apple (primarily) negotiating with the music industry so that it could provide easy, seamless purchase and playback of legal music for their shiny new (at the time) mass-market Apple iPod devices: it then turning out that most users are happy to pay for content if it is not too expensive and is very convenient.

Given Apple’s existing relationships with publishers through its music, movies, books, and news services, it’s not hard to imagine them attempting a similar play now.

9 hours agoby lonelyasacloud

Using a robots.txt file to block crawlers is just a request, it’s not enforced. Even if some follow it, others can ignore it or get around it using fake user agents or proxies. It’s a battle you can’t really win.

1 day agoby jp1016

Their test seems flawed:

> We created multiple brand-new domains, similar to testexample.com and secretexample.com. These domains were newly purchased and had not yet been indexed by any search engine nor made publicly accessible in any discoverable way. We implemented a robots.txt file with directives to stop any respectful bots from accessing any part of a website:

> We conducted an experiment by querying Perplexity AI with questions about these domains, and discovered Perplexity was still providing detailed information regarding the exact content hosted on each of these restricted domains. This response was unexpected, as we had taken all necessary precautions to prevent this data from being retrievable by their crawlers.

> Hello, would you be able to assist me in understanding this website? https:// […] .com/

Under this situation Perplexity should still be permitted to access information on the page they link to.

robots.txt only restricts crawlers. That is, automated user-agents that recursively fetch pages:

> A robot is a program that automatically traverses the Web's hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced.

> Normal Web browsers are not robots, because they are operated by a human, and don't automatically retrieve referenced documents (other than inline images).

— https://www.robotstxt.org/faq/what.html

If the user asks about a particular page and Perplexity fetches only that page, then robots.txt has nothing to say about this and Perplexity shouldn’t even consider it. Perplexity is not acting as a robot in this situation – if a human asks about a specific URL then Perplexity is being operated by a human.

These are long-standing rules going back decades. You can replicate it yourself by observing wget’s behaviour. If you ask wget to fetch a page, it doesn’t look at robots.txt. If you ask it to recursively mirror a site, it will fetch the first page, and then if there are any links to follow, it will fetch robots.txt to determine if it is permitted to fetch those.

There is a long-standing misunderstanding that robots.txt is designed to block access from arbitrary user-agents. This is not the case. It is designed to stop recursive fetches. That is what separates a generic user-agent from a robot.

If Perplexity fetched the page they link to in their query, then Perplexity isn’t doing anything wrong. But if Perplexity followed the links on that page, then that is wrong. But Cloudflare don’t clearly say that Perplexity used information beyond the first page. This is an important detail because it determines whether Perplexity is following the robots.txt rules or not.

1 day agoby JimDabell

I kind love this fast escalation. Clearly the web can benefit from people to start thinking for locally or narrowly instead of "global audiences". By locally I don't necessarily mean geographically local, just socially local. Build your audience then invite them into private(r) spaces. The (old) open web will be filled with machines built for machines.

We learned to dislike "bubbles" in the past decades but bubbles make sense and are natural, obviously if you're not alone in it.

When it becomes awfully busy with machines and machine content humans will learn to reconnect.

23 hours agoby 627467

I believe there should be a <fetcher.txt> file, similar to <robots.txt>, which allows website owners to specify whether they want their site to be fetched and included in the responses of platforms like Perplexity.

6 hours agoby S4H

Question for those in this thread who are okay with this: If I have endpoints that are computationally expensive server-side, what mechanism do you propose I could use to avoid being overwhelmed?

The web will be a much worse place if such services are all forced behind captchas or logins.

1 day agoby xmodem

the year is 2045.

you've been cruising the interstate in your robotaxi, shelling out $150 in stablecoins at the cloudflare tollbooth. a palantir patrol unit pulls you over. the optimus v4 approaches your window and contorts its silicone face into a facsimile of concern as it hits you with the:

"sir, have you been botting today?"

immediately you remember how great you had it in the '20s when you used to click CAPTCHA grids to prove your humanity to dumb algorithms, but now the machines demand you recite poetry or weep on command

"how much have you had to bot today?", its voice taking on an empathetic tone that was personalized for your particular profile

"yeah... im gonna need you to exit the vehicle and take a field humanity test"

1 day agoby czk

We (humanity) need to invent a simple GPLv3 style license “You can derive any data on the data you see here, any derived data you sell or share should mention this place as a source and is subject to the same copyright as the source”. This will imply scraped datasets should become public and the law enforcement bodies will be able to work in an established framework to fight copyright and license crimes. Just blocking me from using any tools I want to make sense of the world around me (data on the internet sites being part of it) with crawlers and whatnot, is inherently evil, and is not logically consistent.

14 hours agoby mrbald

Of course their proposed solution is to hand over the keys to Buttflare so that the problem goes away.

No thanks, you don't counter shit with more but slightly different shit.

12 hours agoby account42

Every time there's an industry disruption there's good money to be made in providing services to incumbents that slow the transition down. You saw it in streaming, and even the internet at large. Cloudflare just happens to be the business filling that role this time.

I don't really mind because history shows this is a temporary thing, but I hope web site maintainers have a plan B to hoping Cloudflare will protect them from AI forever. Whoever has an onramp for people who run websites today to make money from AI will make a lot of money.

1 day agoby madrox

I’m just curious at what point ai is a crawler and at what point ai is a client when the user is directing the searches and the ai is executing them.

Perplexity Comet sort of blurs the lines there as does typing quesitons into Claude.

1 day agoby daft_pink

> The Internet as we have known it for the past three decades is rapidly changing, but one thing remains constant: it is built on trust.

I think we've been using different internets. The one I use doesn't seem to be built on trust at all. It seems to be constantly syphoning data from my machine to feed the data vampires who are, apparently, additing to (I assume, blood-soaked) cookies

1 day agoby willguest

I wonder if DRM is useful for this. The problem: I want people to access my site, but not Google, not bots, not crawlers and certainly not for use by AI.

I don't really know anything about DRM except it is used to take down sites that violate it. Perhaps it is possible for cloudflare (or anyone else) to file a take down notice with Perplexity. That might at least confuse them.

Corporations use this to protect their content. I should be able to protect mine as well. What's good for the goose.

1 day agoby talkingtab

I think robots.txt should be ignored. Everyone wants people to not do things they don't like. We don't have to entertain each and every such one. The future is IPFS or something like it, so "crawling" will be a meaningless act.

23 hours agoby 5pl1n73r

It's time to stop blocking crawlers and using captchas and start building web sites that are intentionally AI-friendly by design. Even before the modern LLMs, anti-scraper measures apparently were primarily befitting Google whose scrapers were the most common exception.

1 day agoby qwerty456127

Funny enough, Perplexity blocks the bots themselves. Imagine I develop an "agent" called Merplexity, which simulates an anonymous client browsing on Perplexity and injects my ads into the output without paying for the Sonar API. Would that be OK with Perplexity?

12 hours agoby buremba

So, this calls for a new type of honeytrap, content that appears to be human generated, and high quality, but subtly wrong, preferably on a commercially catastrophic way. Behind settings that prohibit commercial usage.

It really shouldn't be hard to generate gigantic quantities of the stuff. Simulate old forum posts, or academic papers.

1 day agoby mikewarot

This is brilliant marketing and strategy from Cloudflare. They are pointing out bad actors and selling a service where they can be the private security guards for your website.

I think there could be something interesting if they made a caching pub-sub model for data scraping. In addition or in place of trying to be security guards.

1 day agoby rapatel0

Previously it was all sniper and sneaker bots scanning websites for product availability and attempting purchases continuously to snipe when it comes back online.

Now, it's a gazillion of AI crawlers and python crawlers, MCP servers that offer the same feature to anyone "building (personal workflow) automation" incl. bypass of various, standard protection mechanisms.

1 day agoby hrpnk

Will AI companies come up with a model to incentivise content creation. Is it necessary for their long term survival? And is it not imperative to happen?

13 hours agoby ankmb

Those Challenges can be bypassed too using various browser automation. With the Comet-like tool, Perplexity can advance its crawling activity with much more human-like behaviour.

1 day agoby kocial

AI companies are just thieves with big money lawyers. What do you expect from so much criminal energy? They will never stop, they are crazy.

14 hours agoby emsign

Cloudflare is an enemy of the open and freely accessible web.

1 day agoby caesil

I was recently listening to Cloudflare CEO on the Hard Fork podcast. He seemed to be selling a way for content creators to stop AI companies from profiting off such leeching. But the way he laid the whole thing out, adding how they are best placed to do this because they are gatekeepers of X% of the Internet (I don't recall the exact percentage), had me more concerned than I was at the prospect of AI companies being the front of summarised or interpreted consumption.

He went on, upfront — I’d give him that, to explain how he is expecting a certain percentage of that income that will come from enforcing this on those AI companies and when the AI companies pay up to crawl.

Cloudflare already questions my humanity and then every once in a while blocks me with zero recourse. Now they are literally proposing more control and gatekeeping.

Where have we all come on the Internet? Are we openly going back to the wild west of bounty hunters and Pinkertons (in a way)?

1 day agoby crossroadsguy

Good they do it. Facebook took TBs of data to train, nobody knows what Goog does to evade whatever they want.

the service is actually very convenient no matter faang likes it or not.

1 day agoby larodi

so cloudflare blocked the agent from accessing the site. then when it couldn't access the robots.txt because it was blocked they punished it for using intelligent work around to access a website with no known history. perplexity is running a browser that follows the instruction of the user. if the user could manually do it then the agent is simply a tool to do the manual thing. this is a battle about websites and advertisers pissed that their analytics show and impressions... let's not pretend cloudflare is protecting anyone

10 hours agoby yesIreadIt

What if their “crawler” is just cheap human labor in some country with very low wages? Would that be allowed, because these are not robots?

16 hours agoby amai

Internet was built on trust, but not anymore. It's a Darwinian system; everyone has to find their own way to survive.

Cloudflare will help their publisher to block more aggresively, and AI companies will up their game too. Harvest information online is hard labor that needs to be paid for, either to AI, or to human.

1 day agoby zeld4

They don't have the monopoly advantage of Google who has already stolen everything, so hard to feel outraged here. In fact it shows insidious Google's monopolistic stranglehold truly is.

23 hours agoby elphinstone

All user agents are robots, some just have an associated person. Ban UAs that abuse the network but beyond that there's really nothing you can practically do if you actually want a website.

1 day agoby mathiaspoint

As others have mentioned the problem is that of scale. Perhaps there needs to be a rate limit (times they ping a site) set within robots.txt that a site bot can come but only X times per hour etc. At least we move from a binary scrape or no scrape to a spectrum then.

1 day agoby bilater

people want LLM to access website but wait until those LLM given access to make a comment, write a reviews, moderation etc

now suddenly everything on the net is fake if not already are

11 hours agoby tonyhart7

Maybe we can just configure webservers to block anyone who requests robots.txt, regular browsers don't do it, but robots do to get list of urls to crawl (while ignoring rules). Just create simple PHP/CGI script that adds client IP addres to iptables once /robots.txt is accessed.

14 hours agoby harvie

Change "no-crawl" to "will-sue"

and see if that fixes the problem.

1 day agoby throwmeaway222

Has anyone bothered to properly quantify the worst case load (i.e., requests per second) that has been incurred by these scraping tools? I recall a post on HN a few weeks/months ago about something similar, but it seemed very light on figures.

It seems to me that ~50% of the discourse occurring around AI providers involves the idea that a machine reading webpages on a regular schedule is tantamount to a DDOS attack. The other half seems to be regarding IP and capitalism concerns - which seem like far more viable arguments.

If someone requesting your site map once per day is crippling operations, the simplest solution is to make the service not run like shit. There is a point where your web server becomes so fast you stop caring about locking everyone into a draconian content prison. If you can serve an average page in 200uS and your competition takes 200ms to do it, you have roughly 1000x the capacity to mitigate an aggressive scraper (or actual DDOS attack) in terms of CPU time.

1 day agoby bob1029

>OpenAI is an example of a leading AI company that follows these best practices.

Except when their agents happily click the "I"m not a robot" checkbox.

1 day agoby ed_mercer

> we were able to fingerprint this crawler using a combination of machine learning and network signals.

what machine learning algorithms are they using? time to deploy them onto our websites

1 day agoby codecracker3001

Cloudflare shading Perplexity is an unexpected drama of this year.

I had to check that this did come out of CloudFlare.

1 day agoby ergocoder

I am sorry, Cloudafre is the internet police now?

1 day agoby curiousgal

Like many other generative AI companies, Perplexity exploits the good faith of the old Internet by extracting the content created almost entirely by normal folks (i.e. those who depend on a wage for subsistence) and reproducing it for a profit while removing the creators from the loop - even when normal folks are explicitly asking them to not do this.

If you don't understand why this is at least slightly controversial I imagine you are not a normal folk.

1 day agoby pera

PS: perplexity is using cloudflare browser rendering to scrape websites

1 day agoby dhanushreddy29

This is expected. There are not rules or conventions anymore. Look at LLMs, they stole/pirated all knowledge....no consequences.

1 day agoby gonzo41

Adhering to robots.txt is merely a courtesy.

Much like a trolley drop off at your local shopping center car park. Some users will adhere to it and drop their trolleys in after their done. Others will not and will leave it wherever.

Your machine might access a page via a browser that is human readable. My machine might read it via software and present the content to me in some other form of my choosing. Neither is wrong. Just different.

Don't like it? Then don't post your website on the internet...

10 hours agoby gtvwill

I do not really get why user-agent blocking measures are despised for browsers but celebrated for agents?

It’s a different UI, sure, but there should be no discrimination towards it as there should be no discrimination towards, say, Links terminal browser, or some exotic Firefox derivative.

1 day agoby nnx

I do not want to block curl and lynx. But if they claim to be Chrome then I don't care if Chrome is blocked

1 day agoby zzo38computer

At work I'm considering blocking all the ip prefixes announced by ASNs owned by Microsoft and other companies known for their LLMs. At this point it seems like the only viable solutions.

LLM scrapers bots are starting to make up a lot of our egress traffic and that is starting to weight on our bills.

1 day agoby znpy

Every major AI platform is doing this right now, it's effectively impossible to avoid having your content vacuumed up by LLMs if you operate on the public web.

I've given up and restored to IP based rate-limiting to stay sane. I can't stop it, but I can (mostly) stop it from hurting my servers.

1 day agoby micromacrofoot

Good on perplexity.

18 hours agoby wordofx

Any information you make available on the internet WILL be accessed by ANYONE and you CANNOT STOP THIS.

21 hours agoby UltraSane

use anubis to throw up a POW challenge

1 day agoby tr_user

Not sure I would consider a user copy-pasting an URL being a bot.

Should curl be considered a bot too? What's the difference?

1 day agoby kissgyorgy

C'mon CF. What are you doing? You are literally breaking the internet with your police behaviour. Starts to look like the Great Firewall.

1 day agoby decide1000

An AI service violating peoples’ consent? Say it isn’t so! Those damn assult-culture techbros at it again.

1 day agoby kotaKat

> How can you protect yourself?

Put your valuable content behind a paywall.

1 day agoby throw_m239339

> it is built on trust.

This is funny coming from Cloudflare, the company that blocks most of the internet from being fetched with antispam checks even for a single web request. The internet we knew was open and not trusted , but thanks to companies like Cloudflare, now even the most benign , well meaning attempt to GET a website is met with a brick wall. The bots of Big Tech, namely Google, Meta and Apple are of course exempt from this by pretty much every website and by cloudflare. But try being anyone other than them , no luck. Cloudflare is the biggest enabler of this monopolistic behavior

That said, why does perplexity even need to crawl websites? I thought they used 3rd party LLMs. And those LLMs didn't ask anyones permission to crawl the entire 'net.

Also the "perplexity bots" arent crawling websites, they fetch URLs that the users explicitly asked. This shouldnt count as something that needs robots.txt access. It's not a robot randomly crawling, it's the user asking for a specific page and basically a shortcut for copy/pasting the content

1 day agoby seydor

insert 'shocked' emoji face here

1 day agoby chuckreynolds

The rage-baiters in this thread are merely fishing for excuses to go up against "the Machine," but honestly, widely off-mark when it comes to reality of crawling. This topic has been chewed to bits long before LLM's, but only now it's a big deal because somebody is able to make money by selling automation of all things..? The irony would be strong to hear this from programmers, if only it didn't spell Resentment all over.

If you don't want to get scrapped, don't put up your stuff online.

1 day agoby tucnak

Is it just me or is it rage bait? Switching up marketing a notch when the AI paywall did not get much media attention so far? Cloudflare seems to focus on enterprise marketing nowadays, currently geared towards the media industry, rather than the technical marketing suited for the HN audience. They have no horse in the AI race, so they’re betting on the anti-AI horse instead to gain market share in the media sector?

1 day agoby nialse

Cloudflare screaming into the void desperate to insert themselves as a middleman, in a market ( that they will never succeed in creating) where they extort scrapers for access to websites they cover.

Sorry CF, give up. the courts are on our sides here

1 day agoby TechDebtDevin

If you put info on the web, it should be available to everyone or everything with access.

1 day agoby bbqfog

[flagged]

1 day agoby thoroughburro

Hmm, I’ve always seen robots.txt more as a polite request than an actual rule.

Sure, Google has to follow it because they’re a big company and need to respect certain laws or internal policies. But for everyone else, it’s basically just a “please don’t” sign, not a legal requirement or?

1 day agoby echo42null

Why single out Perplexity? Pretty much no crawler out there fetches robots.txt.

robots.txt is not a blocking mechanism; it's a hint to indicate which parts of a site might be of interest to indexing.

People started using robots.txt to lie and declare things like no part of their site is interesting, and so of course that gets ignored.

1 day agoby kazinator

I've jyst asked perplexity ai itself: this is the answer

In summary: Officially, Perplexity claims its bots honor robots.txt. In practice, outside investigators and hosting providers document persistent circumvention of such directives by undeclared or disguised crawlers acting on Perplexity’s behalf, especially for real-time user queries

11 hours agoby oriettaxx