Policy

Atlantic Exposes 21 Million Songs in AI Training Datasets

A new searchable database reveals the massive scale of copyrighted music being used to train generative AI models, often without proper licensing.

Omega Editorial· June 20, 2026· 2 min read

Millions of tracks used without clear authorization

The Atlantic has published a searchable database documenting four datasets containing more than 21 million songs that have been used to train AI music generation models. Reporter Alex Reisner assembled the database after uncovering the datasets, two of which contain 12 million and 9 million tracks respectively, while two smaller sets include over 100,000 songs each.

The datasets have been downloaded thousands of times, and major AI companies including Google and Stability AI have confirmed using them in published research papers. Artists whose work appears in these training sets range from mainstream acts like Lady Gaga and Bruce Springsteen to experimental musicians like Aphex Twin and Hainbach.

How the datasets circumvent platform protections

While these datasets are technically available online, they don't contain the actual audio files. Instead, three of the four datasets consist of links to songs on YouTube or Spotify. AI developers then use automated tools to download the audio, often bypassing login requirements, advertisements, and other mechanisms designed to generate revenue for creators or the platforms themselves.

This approach violates the terms of service of both YouTube and Spotify. Some datasets, like those sourced from the Free Music Archive, are licensed for personal streaming but explicitly require commercial licensing for business applications—a requirement that appears to be widely ignored.

Why it matters

This database provides concrete evidence of the scale at which copyrighted music is being used to train commercial AI systems, often without proper authorization or compensation to artists. The revelation comes as multiple lawsuits challenge whether using copyrighted material for AI training constitutes fair use. For music industry executives and technology leaders, the database offers unprecedented transparency into which specific works are being used and by whom—information that could prove crucial in ongoing legal battles and licensing negotiations. The Atlantic's tool allows anyone to search for specific artists or songs to see if their work appears in these training datasets.

The database is publicly accessible through the Atlantic's AI Watchdog site, where users can search for songs, books, and other media being used to train AI models. The tool represents one of the first comprehensive public resources documenting the specific copyrighted works fueling the generative AI industry.

These details were first reported by Alex Reisner at The Atlantic.

#ai training data#music copyright#generative ai#google#stability ai#content licensing

This is an original analysis by the Omega editorial team. Source reporting: AI Watch.

Want systems like this working for your business?

Book a Call

More in Policy

Policy· 3 min read

AI Responsibility Means Deciding Who Absorbs the Risk

As Anthropic's model shutdown sparks debate, a healthcare AI leader argues the real question isn't replacement—it's who pays when systems fail.

Via AI Watch · Jun 20, 2026
Policy· 3 min read

California now reports six high-risk AI systems after zero last year

State agencies disclose automated decision tools for recidivism prediction, fraud detection, and student monitoring under 2023 transparency law.

Via AI Watch · Jun 20, 2026
Policy· 2 min read

Michigan Appeals Court Sanctions Attorney for AI Hallucinations

Ronnie Cromer Jr. cited fake cases generated by AI, then used the technology again to file a flawed correction.

Via AI Watch · Jun 20, 2026