US Publishers Demand Common Crawl Halt AI Content Scraping

Share this article

Image credit: Search Engine Journal

WASHINGTON – A prominent U.S. publisher trade group demanded that Common Crawl Foundation cease scraping copyrighted content and remove existing material used for training artificial intelligence models.

Digital Content Next (DCN) sent a cease and desist letter to Common Crawl, arguing that copyright law necessitates permission for content inclusion and challenging the assumption that technically accessible content can be freely collected and monetized.

Common Crawl’s public archive, which has collected billions of web pages since 2007, has been extensively utilized to train various AI models, including OpenAI‘s GPT-3.

Jason Kint, CEO of Digital Content Next, stated that the current opt-out regime for content collection is insufficient and places an undue burden on publishers.

The DCN letter highlighted concerns about Common Crawl’s content removal process, citing past instances where material from publishers, including The New York Times and Danish news organizations, reportedly remained available despite agreements for removal.

Common Crawl Executive Director Rich Skrenta said the organization cannot edit its archive’s file format after publication. However, he added that Common Crawl removes affected URLs from subsequent crawls and makes them inaccessible through public tools.

Skrenta also indicated support for developing open standards that would allow websites to express their preferences regarding AI scraping, maintaining that an opt-out model should remain the standard.

Publishers face a broad challenge because blocking AI crawlers only prevents future collection, leaving content already incorporated into existing archives available for AI training.

If DCN’s position is adopted by more trade groups, it would shift the responsibility from individual publishers managing robots.txt files to requiring explicit permission for content inclusion in large-scale web archives.

Organizations like the News/Media Alliance and The Associated Press have also been vocal about the unauthorized use of their content for AI training, underscoring a growing industry-wide dispute over intellectual property rights in the age of generative AI.

Source: Search Engine Journal

Tags: #ai #Common Crawl #Content Scraping #Copyright #Data Rights #Digital Publishing #Intellectual Property

Written by

Palumbo Angela

Angela Palumbo, Senior Editor at Rabbit Rank since 2023, holds a bachelor's in communications. She focuses on fact-checking and simplifying complex topics while also leading strategy for the news department.

View All Posts

Keep reading