AI Firms Accused of Ignoring Robots.txt and Scraping Content

Lilu Anderson
Photo: Finoracle.net

AI Companies Bypass Rules to Scrape Content, Spark Disputes with Publishers

Several AI companies are ignoring the Robots Exclusion Protocol (robots.txt) to grab content from websites without permission, says TollBit, a content licensing startup. This has caused conflicts between these AI firms and publishers. For instance, Forbes accused the AI company Perplexity of copying its content.

What is Robots.txt?

The robots.txt protocol, created in the mid-1990s, is designed to prevent web crawlers from overloading websites. It's like a "Do Not Enter" sign for parts of a website. Although it's not legally binding, it has usually been respected until now. Publishers rely on this protocol to stop unauthorized content use by AI systems that scrape data to train algorithms and create summaries.

Problem with AI Companies

TollBit points out that many AI agents ignore robots.txt, retrieving content from sites against the rules. Their analytics show that various AI firms are using data for training without getting permission first. Perplexity, for example, has been accused by Forbes of using its investigative stories in AI-generated summaries without giving credit or asking for permission. Perplexity didn't comment on these claims.

How This Affects Publishers

AI-generated news summaries are becoming more popular, which worries publishers even more. Google's AI products create summaries for search queries, and this has escalated concerns. Publishers used robots.txt to block Google’s AI, but this also pulls their content from search results, hurting their online visibility. So, if AI ignores robots.txt, publishers wonder why they should use it and lose web traffic as well.

What Publishers Are Doing

Some publishers, like the New York Times, have sued AI companies for copyright infringement. Others prefer to negotiate licensing deals. The debate continues on how legal and valuable it is for AI to use content for training. Many AI developers argue that accessing content for free isn't breaking any laws unless it’s paid content.

TollBit’s Role

TollBit is also involved in this issue, acting as a middleman between AI companies and publishers. They help establish licensing agreements for content usage. The startup tracks AI traffic on publisher websites and provides analytics to negotiate fees for different content types, including premium content. As of May, TollBit says over 50 websites use its services, although they didn’t list the names.

Conclusion

Content scraping by AI firms without respecting robots.txt protocols is creating friction between AI companies and content publishers. With AI technology advancing, the need for clear rules and fair agreements is more important than ever to protect the rights of content creators.

Bottom Line: More transparent and mutually agreed-upon solutions are essential for the balance between AI innovation and publishers' rights.


Share This Article
Lilu Anderson is a technology writer and analyst with over 12 years of experience in the tech industry. A graduate of Stanford University with a degree in Computer Science, Lilu specializes in emerging technologies, software development, and cybersecurity. Her work has been published in renowned tech publications such as Wired, TechCrunch, and Ars Technica. Lilu’s articles are known for their detailed research, clear articulation, and insightful analysis, making them valuable to readers seeking reliable and up-to-date information on technology trends. She actively stays abreast of the latest advancements and regularly participates in industry conferences and tech meetups. With a strong reputation for expertise, authoritativeness, and trustworthiness, Lilu Anderson continues to deliver high-quality content that helps readers understand and navigate the fast-paced world of technology.