AI Ethics in Question? New Report Claims Rampant Content Scrapping from Websites

It seems that Perplexity is still bypassing web restrictions to scrape data, despite the controversies of 2024. This risky approach may cause Apple and other companies to think twice before making any purchase offers, especially as they promote more ethical practices.

Bots That Disregard Restrictions

In 2024, Perplexity was already under fire for circumventing robots.txt files, which are designed to limit or prevent the scraping of websites by bots. A year later, according to Cloudflare, the company continues to employ even more sophisticated methods.

In an experiment, Cloudflare set up brand new websites and requested information about them from Perplexity. The results were telling: even when robots.txt files blocked access, unidentified new bots (using different IPs, user agents, and ASNs) were detected, enabling the AI to provide details found only on those pages. The accuracy of Perplexity’s responses significantly dropped when these rogue bots were blocked, confirming that they were indeed feeding its model.

Image Cloudflare
Image CloudFlare

A Controversial Defense Line

In response to the accusations, Perplexity published a blog post defending its strategy. The company claims its agents are not scrapers for training models, but distinct digital assistants, accusing Cloudflare of confusing threats with innovation.

However, many find this argument rather thin: as Cloudflare points out, the purpose of a robots.txt is clear—to protect websites, their traffic, and thus their revenue. Allowing an AI to deliver complete answers without redirecting the user to the source directly endangers the livelihood of human-run sites, which are crucial for the existence of AI models.

Unlike Apple, Google, or OpenAI, who adhere to robots.txt, Perplexity seems to be pushing boundaries aggressively. This strategy may permanently damage its reputation. Already, there are talks about a significant setback in its discussions with Apple, which had considered acquiring it in 2024. Cupertino, insisting on ethical data sourcing for Apple Intelligence, now appears to want to distance itself from a startup seen as unreliable.

Apple, the Antithesis of Perplexity?

Apple had already faced criticism last May after it was discovered that Applebot had been collecting web data for several years. However, the company ensures that its practices comply with robots.txt and that no private user data is used to train its models.

By relying on a combination of local models, computations offloaded to a private cloud powered by renewable energies, and a commitment never to exploit user queries, Apple aims to embody an ethical and responsible alternative.

As the competition intensifies, the controversy surrounding Perplexity highlights the growing importance of trust and transparency in generative AI.

4.8/5 - (31 votes)

Leave a Comment