Managing Director & Partner
Vienna
By Tibor Mérey, Sander van Loosbroek, Urs Rahne, Katie Round, and Amelia Doster
Generative AI creates content, but it’s also creating friction that, unchecked, may keep the technology from realizing its potential. This conflict—GenAI companies versus content owners—needs to become a collaboration. And soon.
The problem: GenAI is a fast learner, but IP tracking and compensation aren’t its strong suits. Algorithms uncover patterns and relationships—essential for generating new content—by training on huge volumes of unstructured data, such as news articles, images, even programming code. These data sets are collected in such a wide-scale manner that details like IP rights and payment often get lost in the shuffle. That’s a concern for content creators. But it should also be a concern for anyone looking to, and counting on, the promise of GenAI.
In the absence of tracking that shows when and how models use protected content, and without fair compensation for that use, rights holders are placing their data under guard. Already, many media companies are taking technical steps—and stepping into courtrooms—to prevent artificial intelligence companies from training on their content. This threatens to hamper the quality of GenAI output, which would put investments in GenAI at risk and stifle the development of new solutions.
The answer is a distribution channel that recognizes the value that content creators bring to GenAI, sparking cooperation instead of collision. This platform would provide three key functions: visibility into the source and use of training data, a licensing mechanism, and a payment solution for fair and fast compensation.
Such a platform has so far been elusive, but blockchain offers a good way forward. Although it’s often discounted because of its perceived complexity, blockchain fosters transparency and trust between unfamiliar parties. And it can handle payments down to the microtransaction level. Blockchain can be the backbone for a platform—and an ecosystem—where all sides benefit, to the benefit of us all.
Generative AI is a game changer—literally. Already, organizations are using it to write code faster, generate talking points for sales calls, even create concept art for computer games. The use cases keep growing and with them the market for everything GenAI: servers and storage, infrastructure as a service for training models, digital ads driven by the technology, specialized GenAI assistant software, and so on. According to Bloomberg Intelligence, GenAI is poised to be a $1.3 trillion business by 2032.
The immense value of the market is prompting creators to question the fairness of how GenAI uses their content. It’s not only that models may be training, free of charge, on a creator’s IP but also that the models may generate new content that infringes upon or competes with their own works. The counterargument—given voice in OpenAI’s response to a copyright infringement lawsuit filed by The New York Times—is that training GenAI models on publicly available internet materials is fair use. While the courts work that out (with some 25 lawsuits pending, according to the Copyright Alliance), rights holders are taking a more technical and more immediate approach: blocking the crawlers that forage the Web for training data.
Research by the Reuters Institute for the Study of Journalism found that by the end of 2023, 48% of the most widely used news websites were blocking OpenAI’s crawlers and 24% were blocking Google’s AI crawler. Those figures are almost certainly trending upward. In August 2024, Wired reported that a Who’s Who of media companies, including Facebook, Instagram, The New York Times, Financial Times, and Wired’s own parent, Condé Nast, were excluding their data from Apple’s AI training.
The off-limits signs take high-quality content off the table but also mean a shortage of diverse data. And that’s a problem, even as the overall volume of information continues to grow exponentially. Training on 200 pictures of a cat won’t help GenAI produce a picture of a dog. Models need both quality and diversity in their data sets. As content owners take more data out of circulation, we may see a downward spiral in the quality of GenAI output.
Another concern: as creators increasingly use GenAI in their own work (to augment and accelerate their process), lower-quality output impacts their end products. Ultimately, it’s not just the models that are starved of novel, compelling works. It’s all of us.
Finally, there’s concern about what this all means for GenAI innovation. First movers have already collected training data and as the spigots close—and fast followers find it harder to access sufficient content—their advantage will only increase. If new entrants can’t gain footholds, promising solutions may never make it to market.
The key to moving ahead—and ensuring that diverse, high-quality content remains available—is recognizing that at its core, this is really an ecosystem problem, with interdependencies between the participants. Thriving ecosystems, whether in nature or business, are all about balance. Participants are incentivized toward collaboration rather than competition. For the GenAI ecosystem to thrive, there needs to be a system for content rights and usage where GenAI companies and content creators alike benefit: where working together is a better option than working at odds.
An ecosystem for GenAI would need to offer at least the following functionality:
Currently, no solution tackles all these elements. Legislative approaches, for instance, have focused on transparency.
The European Union’s Artificial Intelligence Act, which came into initial effect in August 2024 (with its provisions to be rolled out over a three-year period), requires the disclosure of copyrighted works that GenAI models use in their training phase. It also calls for a newly created AI Office to develop a “simple and effective” template for summarizing this information. Given the vast amount of training data involved, full transparency promises to be a challenge. And even in the best-case scenario, rights owners would know only that models are training on their content, not how those models then use the data to create new content. Nor does the act provide a mechanism for licensing and compensation.
In the US, legislation is fragmented, with most laws governing AI enacted at the state level—and typically focusing on consumer protection. At the federal level, the proposed Generative AI Copyright Disclosure Act does have an IP focus, but like the EU legislation, stops at transparency. The bill requires the disclosure of copyrighted works in training data sets, enabling rights owners to seek compensation, but it doesn’t address what fair compensation looks like or provide a path for obtaining it.
Traditional licensing mechanisms, meanwhile, are poorly suited for a GenAI world. When a model trains on content—learning patterns and relationships—that’s just the start of the story. For a given query, the model bases its output on specific pieces of training data. Say, for the sake of simplicity, the model trains on ten pieces of content. In responding to a query and creating a new piece of content, the model may rely on four of those ten pieces. Complicating matters further, each of those four pieces may contribute to a different degree, requiring payment proportional to that use. Conventional licensing agreements, memorialized in written legal documents, aren’t designed to track this intricate web of distribution.
Similarly, centralized content platforms are a less-than-ideal solution. Granted, many of these repositories—already a prime source of training data given their large collections of images, text, or video—facilitate rights management and compensation for creators. But centralized platforms also present drawbacks. The house “cut” can be high, reducing creator income. Reliance on an intermediary can lead to delays in payment. And that intermediary, which runs the show, often has outsize control over content distribution and pricing, with creators having little insight into how earnings are calculated and distributed.
If the old ways won’t cut it, a novel approach becomes essential. What’s also clear is that technology is key. New technical standards and platforms are already tackling some of the challenges that widespread access to—and distribution of—digital media content present. For example, to help publishers and consumers verify the authenticity of online content and prevent the spread of misinformation, the Coalition for Content Provenance and Authenticity (C2PA) has developed an open standard for certifying the origin and history of any piece of content. Provenance data, such as who created the work and how it may have been edited subsequently, travels with the content as it flows across the internet.
Closer to home—for those on both sides of GenAI’s copyright clash—is an effort to use technology to tackle the attribution challenge. A startup called ProRata.ai is developing a GenAI platform that would track the specific content it uses to answer a prompt and share revenue with content owners accordingly.
Platforms like ProRata spotlight the need for—and viability of—a symbiotic approach: one where content owners and GenAI companies both benefit. Taking this idea further, leveling the playing field between all GenAI companies and all content owners will require a technological solution that fosters ecosystem growth.
Blockchain is a prime candidate. A distributed database that records information and transactions but prohibits alterations to existing records, blockchain ensures trust and fairness. Organizations use it to track and trade product carbon footprint certificates, facilitate financial transactions (down to the micropayment level), and record ownership of digital and real-world assets, among other applications.
These capabilities mean blockchain can track the origin and use of training data, provide an efficient means for licensing, and support microtransactions on a macro-level scale (crucial because each piece of content a GenAI model generates can be based on thousands of training inputs, resulting in a massive volume of very low payments). In short, it hits the trifecta of transparency, licensing, and compensation.
A counterpoint—and it’s a reasonable one—is the computational cost of such a solution. Granted, current GenAI solutions are not optimized for keeping track of what training data models use when generating new content, and we estimate that the process could raise computational resource requirements by up to 30%. But optimized GenAI solutions should be able to track training data usage much more efficiently and require only 1% to 5% more computational resources. (See “Sparking the Discussion.”)
It’s also true that blockchain is not a plug-and-play technology. The platform’s initial orchestrators—the content providers and GenAI companies leading the charge—will need to partner with a tech venture
But technology alone is not enough. Successfully building a blockchain-based IP platform—and a thriving ecosystem—means aligning participants, ensuring governance, and driving scale in the right way. Orchestrators will take point here: setting the rules, coordinating efforts, and bearing significant upfront costs and risks. The best candidates for the role will possess sufficient resources, be invested in the ecosystem’s success, and be able to manage relationships with other ecosystem members.
To make it all work, orchestrators should take five critical steps:
GenAI companies thrive when content creators thrive. The current situation has proved to be unsustainable, and GenAI companies appear to recognize this. They’re trying various strategies to source high-quality data, comply with new regulations, and prevent more lawsuits. What they might be missing out on is the opportunity to foster a creator economy. Rewarding high-quality content will lead to the creation of more high-quality works: the fuel GenAI companies need to continue generating new content at scale.
Ironically, AI itself can rate submitted content on uniqueness and attractiveness and offer compensation accordingly, encouraging content creators to produce new and exciting works that meet GenAI needs.
But who “pays the bill”?
There are multiple paths toward fair compensation of content creation. Compensation can be paid to creators that contributed to the generation of new content while respecting the flat-fee business models of GenAI companies, much akin to how music-streaming services work these days. Alternatives are also possible. For example, generated content could remain free for individual use but require a license fee for commercial applications. This would allow GenAI companies to share the added revenue stream with creators and secure access to increasing amounts of unique and differentiating content—which will help them take GenAI, and their business, further.
Recognizing the value that creators bring to GenAI, and ensuring fair compensation for using their works, is in the interest of all players in the GenAI space. GenAI companies can meet their growing demand for new training data, while content owners can tap a significant opportunity to benefit from their IP. Working together, instead of at odds, both groups can sustain growth and prosperity. By creating a blockchain-based platform for tracking and monetizing content, an ecosystem can flourish—and GenAI’s impact can keep growing.
The authors thank Marco Badur, Cathy Hackl, Daniel Sack, and Stefan Wang for their contributions to this article.