AI Companies Bypass Rules to Scrape Content, Spark Disputes with Publishers
Several AI companies are ignoring the Robots Exclusion Protocol (robots.txt) to grab content from websites without permission, says TollBit, a content licensing startup. This has caused conflicts between these AI firms and publishers. For instance, Forbes accused the AI company Perplexity of copying its content.
What is Robots.txt?
The robots.txt protocol, created in the mid-1990s, is designed to prevent web crawlers from overloading websites. It's like a "Do Not Enter" sign for parts of a website. Although it's not legally binding, it has usually been respected until now. Publishers rely on this protocol to stop unauthorized content use by AI systems that scrape data to train algorithms and create summaries.
Problem with AI Companies
TollBit points out that many AI agents ignore robots.txt, retrieving content from sites against the rules. Their analytics show that various AI firms are using data for training without getting permission first. Perplexity, for example, has been accused by Forbes of using its investigative stories in AI-generated summaries without giving credit or asking for permission. Perplexity didn't comment on these claims.
How This Affects Publishers
AI-generated news summaries are becoming more popular, which worries publishers even more. Google's AI products create summaries for search queries, and this has escalated concerns. Publishers used robots.txt to block Google’s AI, but this also pulls their content from search results, hurting their online visibility. So, if AI ignores robots.txt, publishers wonder why they should use it and lose web traffic as well.
What Publishers Are Doing
Some publishers, like the New York Times, have sued AI companies for copyright infringement. Others prefer to negotiate licensing deals. The debate continues on how legal and valuable it is for AI to use content for training. Many AI developers argue that accessing content for free isn't breaking any laws unless it’s paid content.
TollBit’s Role
TollBit is also involved in this issue, acting as a middleman between AI companies and publishers. They help establish licensing agreements for content usage. The startup tracks AI traffic on publisher websites and provides analytics to negotiate fees for different content types, including premium content. As of May, TollBit says over 50 websites use its services, although they didn’t list the names.
Conclusion
Content scraping by AI firms without respecting robots.txt protocols is creating friction between AI companies and content publishers. With AI technology advancing, the need for clear rules and fair agreements is more important than ever to protect the rights of content creators.
Bottom Line: More transparent and mutually agreed-upon solutions are essential for the balance between AI innovation and publishers' rights.