Perplexity AI, a prominent name in the realm of AI-driven search and answer engines, is facing significant scrutiny for allegedly circumventing website restrictions through covert crawling techniques. A recent investigation has uncovered that the company has been accessing data from websites that explicitly prohibited its entry, utilizing undisclosed user agents and IP address masking methods.
Like many AI tools, Perplexity relies heavily on data harvested from various corners of the internet to generate its responses. To achieve this, it deploys bots to scrape content from websites. However, instead of adhering to the conventional guidelines that govern bots and crawlers, Perplexity has opted for an unethical approach.
Websites typically employ a file known as robots.txt to inform bots about which areas they can access. This system operates on a fundamental principle of trust: bots identify themselves and comply with these directives. In this instance, however, Perplexity appears to be bypassing these established protocols.
When sites block its known bots, such as PerplexityBot and Perplexity-User, the company reportedly shifts into a more clandestine mode. It masquerades as a generic browser, using a fabricated Google Chrome user-agent string, while rotating IP addresses from various networks to evade these restrictions. Alarmingly, these stealth crawlers often disregard the robots.txt file entirely.
These dubious practices were revealed through controlled testing. Newly established websites with strict no-crawling rules were still accessed by Perplexity, which managed to serve their content in response to user queries, but only after its official bots faced a blockade. When AI tools like Perplexity disrupt this equilibrium, it triggers a cascade of consequences.
Perplexity’s techniques raise numerous ethical concerns. By failing to transparently identify itself, the company violates one of the fundamental tenets of responsible crawling. Employing covert methods to extract data from sites that have explicitly denied access constitutes a form of digital trespassing.
What amplifies the concern is the contrasting behavior of other companies. For instance, OpenAI adheres to well-established crawling protocols. Its bots accurately identify themselves, respect robots.txt files, and cease crawling when blocked. In similar testing scenarios, OpenAI’s ChatGPT acted as expected, halting its activity when instructed.
As AI technologies continue to evolve, the issue of content access becomes increasingly pressing. Website owners should have the authority to dictate how their data is utilized, especially regarding AI training and response generation. Many platforms, particularly those utilizing services like Cloudflare, are proactively blocking AI bots that disregard established rules.
In fact, over 2.5 million websites have opted out of AI training by modifying their robots.txt files or employing automated blocking solutions. Yet, if entities like Perplexity persist in their stealthy tactics, the arms race between crawlers and defenders is bound to intensify.
SOURCE: Cloudflare