New Protocol Aims to Resolve AI Training Data Licensing Challenges

Lilu Anderson
Photo: Finoracle.net

Addressing AI’s Training Data Licensing Crisis

Following Anthropic’s landmark $1.5 billion copyright settlement, the AI sector faces mounting legal pressure over the use of unlicensed training data. With around 40 ongoing lawsuits—including a notable case against Midjourney for generating images of Superman—the industry confronts a potential wave of copyright claims that could significantly hinder AI development.

Introducing Real Simple Licensing (RSL)

In response, a coalition of technologists and web publishers has introduced Real Simple Licensing (RSL), a protocol designed to facilitate licensing of training data at scale. Backed by major online platforms such as Reddit, Quora, and Yahoo, RSL aims to establish standardized, machine-readable licensing agreements accessible across the internet.

Eckart Walther, RSL’s co-founder and also the creator of the RSS standard, emphasized the need for a uniform system: “We need to have machine-readable licensing agreements for the internet. That’s really what RSL solves.”

Technically, RSL allows publishers to define specific licensing terms—ranging from custom licenses to Creative Commons options—embedded within their websites’ robots.txt files. This facilitates automated detection of usage rights by AI companies.

Legally, the RSL Collective functions as a centralized organization to negotiate licensing terms and collect royalties on behalf of rights holders, akin to music licensing bodies like ASCAP. This structure offers publishers a streamlined point of contact and enables collective royalty management across multiple licensors.

Early Adoption and Industry Participation

Several prominent publishers have joined the RSL Collective, including Yahoo, Reddit, Medium, O’Reilly Media, and The Daily Beast, while others like Fastly and Adweek endorse the protocol without participating in the collective. Notably, Reddit already secures an estimated $60 million annually from Google for training data usage, demonstrating that individual deals can coexist within the RSL framework.

Challenges in Implementation

Despite the promise, practical challenges persist. Unlike music or film, tracking the usage of specific documents during AI training is complex. Real-time data sourcing models, such as Google’s AI Search Abstracts, provide clear attribution, but many large language models do not log training data ingestion, complicating royalty calculations, especially for per-inference payments.

Doug Leeds, RSL co-founder and ex-CEO of IAC Publishing, expressed cautious optimism: “Some of the licensing agreements they’ve already done have required them to be able to report on it, so it’s possible. It doesn’t have to be perfect. It just has to be good enough to get people paid.”

The Road Ahead: Will AI Labs Engage?

The critical question remains whether leading AI developers will adopt RSL. While companies like ScaleAI and Mercor demonstrate willingness to pay for quality data, the industry has traditionally relied on freely accessible web data, such as Common Crawl datasets. Discerning between web scraping and machine-assisted browsing adds further complexity, as highlighted by recent disputes like that between Cloudflare and Perplexity.

Leeds pointed to public endorsements of licensing systems by AI leaders, including Sundar Pichai’s remarks at the Dealbook Summit, as encouraging signs. “They have said outwardly to everyone, something like this needs to exist,” he said. “We need a protocol. We need a system.”

Whether this momentum translates into widespread adoption remains to be seen, but RSL represents a significant step toward resolving the entrenched challenges of AI training data licensing.

Russell Brandom has covered technology and platform policy since 2012, contributing to The Verge, Wired, and MIT Technology Review.

FinOracleAI — Market View

The launch of the Real Simple Licensing protocol addresses a critical industry pain point by proposing a scalable framework for AI training data licensing, potentially mitigating the risk of costly copyright litigation. The involvement of major publishers lends credibility and initial momentum, but adoption by AI developers remains uncertain given existing reliance on freely available datasets.

Market participants should monitor whether leading AI labs integrate RSL into their data sourcing strategies, as this could set a precedent for wider industry compliance and reshape data licensing economics. Risks include technical challenges in usage tracking and resistance from cost-sensitive AI firms.

Impact: neutral

Share This Article
Lilu Anderson is a technology writer and analyst with over 12 years of experience in the tech industry. A graduate of Stanford University with a degree in Computer Science, Lilu specializes in emerging technologies, software development, and cybersecurity. Her work has been published in renowned tech publications such as Wired, TechCrunch, and Ars Technica. Lilu’s articles are known for their detailed research, clear articulation, and insightful analysis, making them valuable to readers seeking reliable and up-to-date information on technology trends. She actively stays abreast of the latest advancements and regularly participates in industry conferences and tech meetups. With a strong reputation for expertise, authoritativeness, and trustworthiness, Lilu Anderson continues to deliver high-quality content that helps readers understand and navigate the fast-paced world of technology.