Cloudflare launches a tool to combat AI bots


Publicly traded cloud services provider Cloudflare has launched a new, free tool to prevent bots from scraping websites hosted on its platform for data to train AI models.

Some AI vendors, including Google, OpenAI, and Apple, allow website owners to block bots used for data scraping and model training by amending their site’s robots.txt, a text file that tells bots which pages on a website they can access. But, as Cloudflare points out in a post announcing its bot-combating tools, not all AI scrapers respect this.

“Customers don’t want AI bots visiting their websites, and especially those that do so dishonestly,” the company writes on its official blog. “We fear that some AI companies that are intent on circumventing regulations to access content will continually adapt to avoid bot detection.”

So, in an attempt to solve this problem, Cloudflare analyzed AI bot and crawler traffic to fine-tune its automated bot detection models. The models consider, among other factors, whether an AI bot is trying to evade detection by mimicking the appearance and behavior of a human using a web browser.

“When bad guys attempt to crawl websites at scale, they typically use tools and frameworks that we can fingerprint,” Cloudflare writes. “Based on these signals, our models (are) able to appropriately flag traffic from spoofed AI bots as bots.”

Cloudflare has created a form for hosts to report suspected AI bots and crawlers and says it will continue to manually blacklist AI bots over time.

The problem of AI bots has become more pronounced as the generative AI boom has increased the demand for model training data.

Many sites, wary of AI vendors training models on their content without warning or compensation, have opted to block AI scrapers and crawlers. According to one study, about 26% of the top 1,000 sites on the web have blocked OpenAI’s bots; another study found that more than 600 news publishers had blocked the bots.

However, blocking is no surefire protection. As reported previously, some vendors appear to ignore standard bot exclusion rules to gain a competitive advantage in the AI ​​race. The AI ​​search engine Perplexity was recently accused of impersonating legitimate visitors to remove content from websites, and OpenAI and Anthropic have been accused of ignoring robots.txt rules multiple times.

In a letter to publishers last month, content licensing startup Tollbit said that, in fact, it observes that “many AI agents” are ignoring the robots.txt standard.

Tools like Cloudflare’s can help — but only if they prove accurate at detecting covert AI bots. And they Will not done Addressing the more difficult problem of the risk that publishers will sacrifice referral traffic from AI tools like Google’s AI Overview, which excludes sites from inclusion if they block specific AI crawlers.


Please enter your comment!
Please enter your name here

Share post:




More like this

How IT departments coped with the CrowdStrike chaos

Just before 1:00 a.m. local time on Friday, a...

Alphabet will invest an additional $ 5 billion in Waymo

Alphabet will spend an additional $5 billion on its...