Atlantic Exposes 21 Million Songs in AI Training Datasets
A new searchable database reveals the massive scale of copyrighted music being used to train generative AI models, often without proper licensing.
Millions of tracks used without clear authorization
The Atlantic has published a searchable database documenting four datasets containing more than 21 million songs that have been used to train AI music generation models. Reporter Alex Reisner assembled the database after uncovering the datasets, two of which contain 12 million and 9 million tracks respectively, while two smaller sets include over 100,000 songs each.
The datasets have been downloaded thousands of times, and major AI companies including Google and Stability AI have confirmed using them in published research papers. Artists whose work appears in these training sets range from mainstream acts like Lady Gaga and Bruce Springsteen to experimental musicians like Aphex Twin and Hainbach.
How the datasets circumvent platform protections
While these datasets are technically available online, they don't contain the actual audio files. Instead, three of the four datasets consist of links to songs on YouTube or Spotify. AI developers then use automated tools to download the audio, often bypassing login requirements, advertisements, and other mechanisms designed to generate revenue for creators or the platforms themselves.
This approach violates the terms of service of both YouTube and Spotify. Some datasets, like those sourced from the Free Music Archive, are licensed for personal streaming but explicitly require commercial licensing for business applications—a requirement that appears to be widely ignored.
Why it matters
This database provides concrete evidence of the scale at which copyrighted music is being used to train commercial AI systems, often without proper authorization or compensation to artists. The revelation comes as multiple lawsuits challenge whether using copyrighted material for AI training constitutes fair use. For music industry executives and technology leaders, the database offers unprecedented transparency into which specific works are being used and by whom—information that could prove crucial in ongoing legal battles and licensing negotiations. The Atlantic's tool allows anyone to search for specific artists or songs to see if their work appears in these training datasets.
The database is publicly accessible through the Atlantic's AI Watchdog site, where users can search for songs, books, and other media being used to train AI models. The tool represents one of the first comprehensive public resources documenting the specific copyrighted works fueling the generative AI industry.
These details were first reported by Alex Reisner at The Atlantic.
This is an original analysis by the Omega editorial team. Source reporting: AI Watch.
Want systems like this working for your business?
Book a Call