Amazon is investigating Perplexity over claims of scraping abuse

Date:

Amazon’s cloud division has launched an investigation into Perplexity AI. At issue, Wired has learned, is whether the AI ​​search startup is violating Amazon Web Services’ rules by scraping websites that tried to prevent it from doing so.

An AWS spokesperson, who spoke to WIRED on the condition of anonymity, confirmed the company’s investigation of Perplexity. WIRED previously found that the startup — which is backed by the Jeff Bezos family fund and Nvidia, and was recently valued at $3 billion — relied on scraping content from websites that had restricted access via the Robots Exclusion Protocol, a common web standard. While the Robots Exclusion Protocol is not legally binding, terms of service generally are.

The Robots Exclusion Protocol is a decades-old web standard that involves placing a plaintext file (such as wired.com/robots.txt) on a domain to indicate which pages should not be accessed by automated bots and crawlers. While companies using scrapers may choose to ignore this protocol, most have traditionally respected it. An Amazon spokesperson told WIRED that AWS customers should follow the robots.txt standard when crawling websites.

“AWS’s terms of service prohibit customers from using our services for any illegal activity, and our customers are responsible for complying with our terms and all applicable laws,” the spokesperson said in a statement.

The investigation into Perplexity’s modus operandi follows a June 11 report by Forbes accusing the startup of plagiarizing at least one article. WIRED’s investigation confirmed this modus operandi and found further evidence of abuse and plagiarism being removed by systems connected to Perplexity’s AI-powered search chatbot. Engineers at WIRED’s parent company Condé Nast blocked Perplexity’s crawler on all of its websites using a robots.txt file. But WIRED found that the company had access to a server using an unpublished IP address — 44.221.181.252 — that visited Condé Nast properties at least hundreds of times over the past three months, apparently to scrape Condé Nast’s websites.

The machine associated with Perplexity appears to have engaged in widespread crawling of news websites that block bots from accessing their content. Spokespeople for The Guardian, Forbes and The New York Times also say they have traced the IP address back to its servers multiple times.

WIRED traced the IP address to a virtual machine known as an Elastic Compute Cloud (EC2) instance hosted on AWS, which launched its investigation after we asked whether using AWS infrastructure to scrape websites that prohibit it was a violation of the company’s terms of service.

Last week, Perplexity CEO Arvind Srinivas first responded to WIRED’s inquiry, saying that the questions we asked the company “reflect a deep and fundamental misunderstanding of Perplexity and how the internet works.” Srinivas then told Fast Company that the anonymous IP address WIRED saw while scraping Condé Nast’s website and the test site we created was operated by a third-party company that provides web crawling and indexing services. He declined to name the company, citing a nondisclosure agreement. When asked if he would ask the third party to stop crawling WIRED, Srinivas replied, “It’s complicated.”

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Share post:

Subscribe

spot_imgspot_img

Popular

More like this
Related

Lecera, which protects enterprises from LLM vulnerabilities, raises $20 million

Swiss startup Lecrae, which is building technology to protect...

Meta Quest 3 will soon get Meta AI vision and chatbot capabilities

The Meta Quest 3 will soon be integrated with...

How IT departments coped with the CrowdStrike chaos

Just before 1:00 a.m. local time on Friday, a...