Policy

How AI Training Data Could Be Priced Using Existing Methods

Two researchers propose a market mechanism that values content contributions without new experiments, using data AI companies already produce during model training.

Omega Editorial· June 15, 2026· 3 min read

How AI Training Data Could Be Priced Using Existing Methods

AI companies already possess the technical information needed to fairly compensate content creators for training data, according to a framework proposed by economist E. Glen Weyl and computer scientist Raul Castro Fernandez. The approach sidesteps the argument that valuing individual contributions is prohibitively expensive by using two datasets that model builders generate as a standard part of training.

The proposal arrives as publishers, authors, and artists pursue copyright litigation while AI companies defend their use of publicly available data as fair use. Both sides claim the other's position is economically unworkable. But documents from Anthropic executives that surfaced in legal discovery show industry leaders have known since at least 2021 that low-cost valuation methods exist.

The technical foundation

The framework rests on two pieces of information AI companies already produce:

Data mixture weights reveal relative value. Before training, builders decide what proportion of each data type—web text, books, code, news articles—to include. These proportions aren't arbitrary. According to the equimarginal principle in economics, if the mixture is optimized, the last token from each source contributes roughly equally to model performance. Sources weighted more heavily are demonstrably more valuable. The weights cost nothing extra to calculate because they're required for training.

Scaling laws reveal the total value attributable to data. These empirical relationships between model performance, compute, and training data volume show how much of a model's value comes from each input. The researchers calculate that data accounts for roughly 40-50% of pre-training value using standard industry estimates, though Anthropic executives Dario Amodei and Chris Olah estimated approximately 20% in their internal memo. A one-third midpoint serves as a working figure.

The payment base would be per-model operating profit—the profit over variable costs of serving a specific model—similar to how Hollywood grants profit shares to creative contributors.

The distribution mechanism

Weyl and Fernandez propose adapting collective management organizations (CMOs) like ASCAP and BMI, which have distributed music royalties for over a century. The system would operate in three steps:

  1. Determine total payment as a percentage of each model's operating profit, anchored to the scaling-law-implied data share
  2. Divide payments across data sources using reported mixing weights
  3. Distribute to creators through CMOs, similar to how ASCAP pays songwriters

Recent European Parliament acts and White House executive orders have suggested using CMOs for pre-training data compensation, though without specifying the technical mechanism.

Why it matters

The current fight focuses on roughly $15 billion in operating profits generated by AI models to date—too little to structurally change any industry. But if AI companies reach the tens of trillions in annual earnings that industry leaders project, compensation to creators could reach trillions per year. This would give large populations a productive stake in AI's future rather than dependence on universal basic income proposals. More immediately, AI companies face a data quality problem: research on "model collapse" shows that training on synthetic content degrades performance as outputs homogenize. Fresh human data requires economic institutions—newsrooms, publishers, universities—that need revenue to survive.

The framework was detailed in Harvard Business Review by Weyl, who co-founded the RadicalxChange Foundation and leads Microsoft Research's Plural Technology Collaboratory, and Castro Fernandez, an assistant professor of computer science at the University of Chicago.

#ai training data#content compensation#scaling laws#data valuation#collective management organizations#model training

This is an original analysis by the Omega editorial team. Source reporting: AI Watch.

Want systems like this working for your business?

Book a Call

More in Policy

Policy· 5 min read

UK Police AI Evidence Case Exposes Chain-of-Custody Crisis

When officers delete original recordings after AI transcription, courts lose the ability to separate honest error from deliberate fabrication.

Via AI Watch · Jun 15, 2026
Policy· 3 min read

Data Center Developers Face Power Shortages and Community Pushback

Bloom Energy survey finds 61% of developers plan to generate their own electricity as local opposition and grid constraints threaten AI infrastructure expansion.

Via AI Watch · Jun 15, 2026
Policy· 3 min read

Columbia University Event Examines AI's Role in Rising Power Costs

New research publications and expert panel will address electricity price increases and data center demand in Washington DC gathering.

Via AI Watch · Jun 15, 2026