Google’s C4 dataset scraped hundreds of my web pages without permission

Google’s C4 dataset used for AI modelling is built upon hundreds of millions of scraped web pages from 15 million websites. Is your website one of them?

Mine is. Actually, many of mine are, including Lean Media:

google c4 dataset scrape web copyrighted content example

I also found personal blogs, book websites, my genealogy business’ website and blog, and other content included in the C4 dataset. No permission was asked or granted, despite copyright notices on almost every page of the websites I own and operate.

Indeed, a separate Washington Post article found the “copyright symbol appears more than 200 million times in the C4 data set.” As Deborah Edwards-Oñoro notes on the Lireo Design blog:

Like many other people who publish online, my content is copyrighted.

I didn’t provide consent for my content to be scraped and used for Large Language Model (LLM) training data.

Where is my check from Google? The check paying me for helping to train their LLM with my content?

A bigger question I have: How is this C4 data being monetized by other companies, and how will the creators of the data used to train the C4 model be compensated by them? This isn’t a case of Amazon pirates using ChatGPT summaries to rip off authors or “write” books. This is one of the biggest companies on the planet taking written content from individuals and private companies without permission to make even more money and “own” the generative AI and LLM space for decades to come.

You can search for your own content (text or URLs) in the dataset here. Do it now and take screenshots; I don’t think this C4 search engine hosted The Allen Institute for Artificial Intelligence (AI2) will remain public for long. The Allen Institute claims an “Open Data Commons Attribution License” and “Common Crawl” terms of use, but I and millions of other writers and creators never consented to either.

#ai #llm #copyright

Leave a Comment

Your email address will not be published. Required fields are marked *

Time limit is exhausted. Please reload the CAPTCHA.

Scroll to Top